This project evaluates a production inference stack built on top of existing OSS projects. The stack is designed based on the llm-d reference stack. Credits go to llm-d contributors for the reference architecture and the contribution of several core components, such as the EPP. In this stack, KAITO is the inference engine, and we focus on evaluating the request routing and autoscaling performance. We run the vLLM simulator so that the entire stack can be evaluated using CPUs only.
- Istio Gateway — Entry point for all inference requests. Routes client requests (e.g.,
POST /v1/chat/completions) through the stack. - llm-gateway-auth — ext_authz API-key authorization filter. Validates the
Authorization: Bearer <token>header against anAPIKeycustom resource resolved from the request'sHostsubdomain (<namespace>.gw.example.com) before any routing or model dispatch happens. Ships two components —apikey-operator(reconcilesAPIKeyCRs into per-namespace Secrets) andapikey-authz(the ext_authz dataplane). - body-based routing (BBR) — Parses request body to extract the model name and injects the
X-Gateway-Model-Nameheader, enabling model-level routing. - llm-d-inference-scheduler (EPP) — Per-model Endpoint Picker (image
mcr.microsoft.com/oss/v2/llm-d/llm-d-inference-scheduler). Performs KV-cache aware routing by injecting thex-gateway-destination-endpointheader, directing requests to the optimal inference pod. - Kaito InferenceSet — Manages groups of vLLM inference pods. Multiple InferenceSets (e.g., Model-A, Model-B) can run different models simultaneously.
- vLLM Inference Pods(llm-d-inference-sim) — Serve model inference requests. On CPU-only E2E clusters, the real vLLM container is replaced by a shadow pod running llm-d-inference-sim (image
ghcr.io/llm-d/llm-d-inference-sim), a lightweight vLLM-compatible simulator that exposes the same OpenAI API andvllm:*Prometheus metrics. Seepkg/gpu-node-mocker/README.mdfor the original-pod ↔ shadow-pod mechanism. - keda-kaito-scaler — Metric-based autoscaler built on KEDA that scales vLLM inference pods up and down based on workload metrics.
- Mocked GPU Nodes / CPU Nodes — Infrastructure layer providing compute resources for inference workloads. The
gpu-node-mockercontroller (E2E-only) fakes GPU nodes on CPU-only clusters and runs thellm-d-inference-simshadow pods on real CPU nodes.
- Client → Istio Gateway. Client sends
POST /v1/chat/completionsto<namespace>.gw.example.comwith a bearer token. - Gateway → ext-proc filters.
llm-gateway-authvalidates the token; BBR parses the body and injectsX-Gateway-Model-Name. - Gateway → EPP. The per-deployment
HTTPRoutematches the model name and callsllm-d-inference-scheduler, which returns the target pod viax-gateway-destination-endpoint. - Gateway → vLLM Pod. Envoy forwards the request directly to the chosen inference pod; the response streams back along the reverse path.
- Unmatched models. The namespace's catch-all
HTTPRoutereturns an OpenAI-compatible404 model_not_foundfrom the cluster-shareddefault/model-not-foundService.
- vLLM pods → metrics. Each pod exposes
vllm:*Prometheus metrics (queue depth, KV-cache utilisation, request rate). - keda-kaito-scaler → KEDA. The external scaler aggregates per-
InferenceSetpod metrics and returns a single summed metric value. - KEDA → HPA → InferenceSet. KEDA exposes that value through the external metrics API; the HPA computes the desired replica count from it and patches the
InferenceSet, and the KAITO controller adds or removes vLLM pods.
Install the stack in three steps. Step 1 is one-time per cluster; steps 2 and 3 are repeated per workload namespace and per model.
Installed by hack/e2e/scripts/install-components.sh
(or its production equivalent). These components live across multiple
namespaces and are shared by every model deployment:
| Component | Namespace | Version (versions.env) |
Install method | Role |
|---|---|---|---|---|
| KAITO workspace controller | kaito-system |
latest chart, image nightly-latest |
helm | Reconciles InferenceSet and provisions inference pods. |
gpu-node-mocker (E2E-only) |
kaito-system |
repo HEAD (SHADOW_CONTROLLER_IMAGE) |
helm | Creates fake GPU nodes + shadow pods on CPU-only clusters. |
| Gateway API CRDs | cluster-scoped | GATEWAY_API_VERSION (v1.2.0) |
kubectl | Required for Gateway, HTTPRoute, ReferenceGrant. |
Istio control plane (istiod) |
istio-system |
ISTIO_VERSION (1.29.2) |
istioctl | Implements the Gateway dataplane (Envoy) and ext_proc filter chain. |
| GAIE CRDs | cluster-scoped | latest | kubectl | InferencePool, InferenceObjective. |
| BBR (Body-Based Router) | istio-system |
BBR_VERSION (v1.3.1) |
helm | Installed in Istio's rootNamespace so its EnvoyFilter applies cluster-wide; injects X-Gateway-Model-Name. |
llm-gateway-auth (kaito-project/llm-gateway-auth) |
llm-gateway-auth |
LLM_GATEWAY_AUTH_VERSION |
helm | API-key ext_authz for the inference-gateway. Installs the APIKey CRD, the apikey-operator (reconciles APIKey → per-namespace Secret), and the apikey-authz ext_authz dataplane wired into Istio via MeshConfig + AuthorizationPolicy. |
KEDA + KEDA Kaito Scaler (kaito-project/keda-kaito-scaler, optional) |
keda |
KEDA_VERSION (v2.19.0), KEDA_KAITO_SCALER_VERSION (v0.4.1) |
helm | Workload-metric autoscaling. |
model-not-found (Deployment + ConfigMap + Service) |
default |
repo HEAD (hack/e2e/manifests/model-not-found.yaml) |
kubectl | Cluster-shared nginx-backed Service that returns OpenAI-compatible 404 model_not_found JSON. Referenced cross-namespace by every workload namespace's catch-all HTTPRoute (authorised via a ReferenceGrant rendered by charts/modelharness). |
Provisioned by the charts/modelharness Helm
chart. One Helm release per workload namespace owns every per-namespace
shared resource — the Istio Gateway that fronts the namespace, the
catch-all HTTPRoute (forwards unknown-model requests to the
cluster-shared default/model-not-found Service), the
ReferenceGrant authorising that cross-namespace backendRef, and —
when API-key auth is enabled — the per-namespace AuthorizationPolicy
APIKeyCR that wire that Gateway into the cluster-wideapikey-ext-authzCUSTOM provider. A namespace may host one or more model deployments, all of which share its Gateway:
| Resource | Where | Version | Source | Role |
|---|---|---|---|---|
Gateway (gateway.networking.k8s.io/v1) |
Per namespace | API v1 |
charts/modelharness |
Public entry point; gatewayClassName: istio, HTTP/80. |
Catch-all HTTPRoute model-not-found-route |
Per namespace | API v1 |
charts/modelharness |
Forwards unmatched paths on the namespace's Gateway to the cluster-shared default/model-not-found Service via a cross-namespace backendRef. |
ReferenceGrant allow-model-not-found-from-<ns> |
default |
API v1beta1 |
charts/modelharness |
Authorises the per-namespace catch-all HTTPRoute to reference default/model-not-found across namespaces. |
AuthorizationPolicy apikey-gateway-ext-authz (auth-enabled) |
Per namespace | security.istio.io/v1 |
charts/modelharness (auth.enabled) |
Wires the per-namespace Gateway pod into the cluster-wide apikey-ext-authz CUSTOM provider (registered in MeshConfig by llm-gateway-auth). |
APIKey default (auth-enabled) |
Per namespace | apikeys.kaito.sh/v1alpha1 |
charts/modelharness (auth.enabled) |
Triggers the apikey-operator to reconcile a Secret (llm-api-key) holding the bearer token clients send. |
In the E2E suite the chart is installed and uninstalled by
EnsureNamespace / DeleteNamespace (called
from InstallCase / UninstallCase in
cases.go). helm uninstall modelharness cleans
up the cross-namespace ReferenceGrant automatically. The two
auth-related resources are skipped when auth.enabled=false.
Provisioned by the charts/modeldeployment Helm chart. One Helm release
per model deployment, parented to the namespace's Gateway:
| Resource | Version (chart-rendered) | Install method | Role |
|---|---|---|---|
InferenceSet (kaito.sh/v1alpha1) |
v1alpha1 |
helm | Reconciled by KAITO; renders inference pods running vLLM. |
InferencePool (inference.networking.k8s.io/v1) |
v1 |
helm | Selects the inference pods backing this deployment. |
EPP Deployment + Service + RBAC + ConfigMap |
apps/v1, v1, rbac/v1 |
helm | Endpoint Picker (llm-d-inference-scheduler) for KV-cache aware routing. |
HTTPRoute (gateway.networking.k8s.io/v1) |
v1 |
helm | Matches X-Gateway-Model-Name == <name> on the namespace's Gateway and forwards to the InferencePool. |
The chart's name value is the per-deployment routing key; model is
the underlying KAITO preset. See the
charts/modeldeployment chart README
for the full value schema and install examples.
A flat index of the CRD-backed resources Production Stack creates,
grouped by the controller / chart that owns it. Kubernetes native objects
(Deployment, Service, ConfigMap, ServiceAccount, Role /
RoleBinding, Pod, Node, …) are intentionally omitted — they are
implementation details of the charts above and are not listed here.
| Resource (Kind) | Group / Version | Source | Purpose |
|---|---|---|---|
Workspace |
kaito.sh/v1alpha1 |
KAITO | Aggregates inference workloads (used indirectly via InferenceSet). |
InferenceSet |
kaito.sh/v1alpha1 |
KAITO | Declares one model deployment; KAITO renders inference pods. |
ReferenceGrant |
gateway.networking.k8s.io/v1beta1 |
Kubernetes Gateway API | Authorises each workload namespace's catch-all HTTPRoute to reference the cluster-shared default/model-not-found Service |
InferencePool |
inference.networking.k8s.io/v1 |
Gateway API Inference Extension (GAIE) | GAIE pool selecting the inference pods backing a deployment. |
InferenceObjective |
inference.networking.k8s.io/v1 |
Gateway API Inference Extension (GAIE) | API object defining objective contracts; CRD only — not authored by this stack. |
APIKey |
apikeys.kaito.sh/v1alpha1 |
kaito-project/llm-gateway-auth |
Declares an API key for a gateway namespace; the apikey-operator reconciles it into a Secret (llm-api-key by default) consumed by the apikey-authz ext_authz filter. |
Gateway |
gateway.networking.k8s.io/v1 |
Kubernetes Gateway API | Per-namespace public entry point; gatewayClassName: istio, HTTP/80. |
HTTPRoute |
gateway.networking.k8s.io/v1 |
Kubernetes Gateway API | Model-specific routes match X-Gateway-Model-Name == <name> → InferencePool; per-namespace catch-all routes unmatched paths to the namespace-local model-not-found for an OpenAI 404. |
EnvoyFilter |
networking.istio.io/v1alpha3 |
Istio | BBR injects ext_proc into every Istio Gateway via rootNamespace. |
AuthorizationPolicy |
security.istio.io/v1 |
Istio (rendered by llm-gateway-auth) |
Targets the inference-gateway Pod and routes ext_authz to the apikey-authz provider so every request must carry a valid APIKey-derived bearer token. |
The E2E suite under test/e2e/ exercises the full stack
(Gateway → llm-gateway-auth (ext_authz) → BBR → EPP
(llm-d-inference-scheduler) → vLLM shadow pod
(llm-d-inference-sim)) against a live AKS cluster. Tests run as
parallel Ordered Ginkgo Describes, one per case namespace.
See test/e2e/README.md for the full framework
guide, helper API, and the
Adding a new e2e test
workflow.
Production Stack is licensed under the Apache License 2.0.
