Knative Serving provides a simple way to run HTTP workloads on Kubernetes that automatically scale based on traffic. One of the most useful capabilities is scale to zero. When a service receives no traffic, the pods disappear completely. When a request arrives, Knative starts the workload again and routes the request to it.

Motivation

Recently, I discussed this problem with a friend who hosts many small client services on a public cloud platform. These services are used infrequently, some only a few times per month, yet they incur continuous hosting charges. He wanted to reduce costs without sacrificing flexibility. Moving everything back to on-premises hardware seemed like an option, but it would sacrifice the ability to scale globally or easily migrate back to the cloud. Knative with scale-to-zero offers a middle ground: run a small platform at home, host services that consume resources only when actually invoked, and maintain the option to move workloads back to public cloud infrastructure if needs change. For anyone managing many lightweight internal services, especially those used infrequently, this pattern can dramatically reduce infrastructure costs while preserving operational flexibility.

This post explores the concept and mechanics of Knative’s scale-to-zero behavior. It uses NGINX only as a simple example container. A key finding from setting this up is that Knative services must be addressed by their fully qualified DNS name to work reliably within the cluster.

Knative must already be installed in the cluster. If you need installation instructions, refer to the official documentation at https://knative.dev/docs/install/

How Knative Scale-to-Zero Works

When traffic reaches a Knative service, it passes through several platform components before reaching your container. This is how Knative manages autoscaling and cold starts.

  flowchart TD
    A[Client request] --> B[Service DNS]
    B --> C[Knative route]
    C --> D{Scaled to zero?}
    D -- Yes --> E[Activator]
    D -- No --> F[Revision pod]
    E --> F

If the service has scaled down, the Activator temporarily receives the request while Knative starts a new pod. Once the pod is ready, traffic flows directly to it.

Inside the revision pod, Knative injects the queue-proxy container alongside your application.

  flowchart TD
    subgraph RevisionPod["Revision Pod"]
        A[queue-proxy]
        B[user container]
        A --> B
    end

The queue-proxy is responsible for request buffering, concurrency control and metrics collection used by the Knative autoscaler.

A minimal Knative service example

Below is a simple example service. The container image itself is not important. NGINX is used here only because it is a convenient lightweight HTTP server.

In this example, the service is marked cluster-local (reachable only from inside the cluster), but Knative services can also be public-facing if needed.

apiVersion: v1
kind: Namespace
metadata:
    name: knative-demo
---
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
    name: example-service
    namespace: knative-demo
    labels:
    networking.knative.dev/visibility: cluster-local
spec:
    template:
    metadata:
        annotations:
        autoscaling.knative.dev/min-scale: "0"
        autoscaling.knative.dev/max-scale: "3"
        autoscaling.knative.dev/window: "30s"
    spec:
        containers:
        - image: nginx:stable
            ports:
            - containerPort: 80

Apply the manifest:

kubectl apply -f service.yaml

Knative will automatically create the underlying revision and deployment. You never define the Deployment yourself.

Invoking the service

Knative services are addressed by their DNS name. For cluster-local services, use the internal DNS:

example-service.knative-demo.svc.cluster.local

Requests can come from any pod in the cluster. A simple way to test this is to run a temporary container and call the service.

kubectl run curl --rm -it --image=curlimages/curl -- \
curl http://example-service.knative-demo.svc.cluster.local

The first request may take slightly longer because the service may need to scale from zero.

Observing scale to zero

You can watch the revision pods while sending traffic.

kubectl get pods -n knative-demo -w

The behaviour typically looks like this:

  stateDiagram-v2
    [*] --> Idle
    Idle --> ZeroPods: service ready
    ZeroPods --> TrafficArrives: request received
    TrafficArrives --> PodStarting: coordinator scales
    PodStarting --> Running: pod ready
    Running --> TrafficStops: no traffic
    TrafficStops --> ZeroPods: grace period expires

This allows clusters to host many small services without keeping them permanently running.

Note on Istio and Knative

Knative always injects the queue-proxy container into revision pods to handle request queuing, concurrency control, and metrics collection. If Istio is installed in the cluster with sidecar injection enabled for your namespace, Istio will add a second proxy container (istio-proxy) for traffic management and observability. This results in two proxies in the request path: traffic flows through istio-proxy first, then queue-proxy, before reaching your application. Both proxies serve important functions and work together without conflict.

Knative Scale to Zero: Running Low-Traffic Services Efficiently

Motivation

How Knative Scale-to-Zero Works

A minimal Knative service example

Invoking the service

Observing scale to zero