A production-style performance engineering toolkit for measuring and monitoring LLM inference latency, throughput, and concurrency behaviour — running entirely on local hardware via Ollama.
Built to demonstrate how performance engineering discipline applies to AI/ML inference systems: the same rigour you'd bring to a web service benchmark (percentile tracking, SLO gates, CI integration, dashboards) applied to the unique characteristics of autoregressive token generation.
LLM inference has a fundamentally different performance profile from a typical REST API:
| Concern | Traditional API | LLM Inference |
|---|---|---|
| Latency shape | Single response time | TTFT + streaming generation |
| Bottleneck | I/O, DB, network | GPU memory bandwidth, KV-cache |
| Concurrency model | Stateless horizontal scale | Batching, attention mechanisms |
| SLO design | p99 end-to-end | TTFT SLO + throughput floor |
Understanding this distinction is what separates a performance engineer who can work on AI infra from one who cannot.
TTFT (Time to First Token) is the metric that maps directly to user-perceived latency in streaming chat interfaces. A 15-second generation with a 200ms TTFT feels responsive. A 2-second generation with a 1500ms TTFT feels broken.
Token throughput (tokens/s) is the primary capacity metric — it determines how many concurrent users a given model deployment can serve within your quality-of-service budget.
┌──────────────────────────────────────────────────────┐
│ main.py (CLI) │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ runner.py │ │ metrics.py │ │reporter.py │ │
│ │ asyncio │ │ percentile │ │ JSON/HTML │ │
│ │ aiohttp │ │ aggregation │ │ Chart.js │ │
│ └──────┬───────┘ └──────────────┘ └────────────┘ │
└─────────┼────────────────────────────────────────────┘
│ NDJSON streaming (HTTP/1.1)
▼
┌───────────┐ ┌────────────┐ ┌─────────┐
│ Ollama │───────▶│ Prometheus │─────▶│ Grafana │
│ :11434 │ │ :9090 │ │ :3000 │
└───────────┘ └────────────┘ └─────────┘
Key design decisions:
asyncio+aiohttpfor concurrent load generation — same mental model as k6 virtual users, but in Python without a separate binary- Streaming NDJSON parsing to capture TTFT accurately without buffering the full response
- Semaphore-bounded concurrency so we model exactly N simultaneous users rather than flooding with goroutines
- Warm-up phase excluded from measurements to avoid JIT/cache-cold bias (same principle as JMeter's ramp-up)
- Pure computation layer (
metrics.py) with no I/O so it's fully unit-testable without a running server
| Metric | Unit | Why It Matters |
|---|---|---|
| TTFT | ms | User-perceived response start latency |
| End-to-end latency | ms | Full generation cost; capacity planning |
| Token throughput | tokens/s | Primary capacity metric for serving |
| p50 / p95 / p99 latency | ms | Tail latency reveals worst-case user experience |
| Requests per second | req/s | Wall-clock throughput of the inference server |
| Error rate | % | Timeout/failure rate under load |
- Python 3.11+
- Ollama installed and running
# 1. Clone and install Python dependencies
git clone https://github.com/yourusername/llm-bench
cd llm-bench
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 2. Pull the model (one-time, ~1.3 GB)
bash scripts/setup_ollama.sh
# 3. Run the benchmark
python main.pyResults are written to results/benchmark_latest.json and results/report_latest.html.
docker compose up -d
# Grafana → http://localhost:3000 (admin / llmbench)
# Prometheus → http://localhost:9090
# Ollama → http://localhost:11434# Use a different config file
python main.py --config config/benchmark_config.yaml
# Override concurrency levels
python main.py --concurrency 1 --concurrency 10 --concurrency 50
# CI mode: exit 1 if any SLO is violated
python main.py --ci
# Override output directory
python main.py --output-dir /tmp/bench-resultsAll tunables live in config/benchmark_config.yaml:
ollama:
model: "llama3.2:1b" # swap to tinyllama for faster CI runs
timeout_seconds: 120
benchmark:
concurrency_levels: [1, 5, 10, 20] # the "load steps" — like k6's stages
requests_per_level: 20
warmup_requests: 3 # excluded from results
slo:
p95_latency_ms: 30000 # CI fails if p95 exceeds this
p99_latency_ms: 60000
min_throughput_tps: 2Results below are from llama3.2:1b on a MacBook Pro M2 (CPU only, no GPU offload):
| Concurrency | p50 Latency | p95 Latency | p99 Latency | TTFT p50 | Throughput |
|---|---|---|---|---|---|
| 1 | 8,200ms | 11,400ms | 12,100ms | 180ms | 14.2 tps |
| 5 | 19,800ms | 28,600ms | 31,200ms | 420ms | 12.8 tps |
| 10 | 38,400ms | 52,000ms | 58,100ms | 810ms | 11.1 tps |
| 20 | 71,200ms | 95,300ms | 104,000ms | 1,600ms | 9.4 tps |
How to read this:
- p50 latency scales roughly linearly with concurrency — this is expected for a CPU-bound single-process server with no request batching (Ollama's default mode). A GPU-backed vLLM deployment with continuous batching would show much flatter scaling.
- TTFT increases with concurrency because queued requests wait longer before the model starts generating their response — this is the prefill queue building up.
- Throughput drops under high concurrency (14.2 → 9.4 tps) because context-switching overhead increases and memory bandwidth is shared across more simultaneous KV-caches.
- The p99/p95 ratio widens at high concurrency — this is the signature of a queuing system approaching saturation. Once utilisation exceeds ~70%, tail latency blows up disproportionately (Little's Law in action).
What a PE would do next: Plot the knee of the curve to find the concurrency level where p95 starts diverging from p50. That's your practical concurrency ceiling for this SLO.
Every push to main runs a lightweight benchmark against tinyllama and fails the build if p95 latency exceeds the configured threshold. This prevents inference regressions from landing in production — the same gate pattern you'd use for API performance budgets.
GitHub Push
│
▼
Unit Tests (no Ollama) ──fail──▶ ✗ Block merge
│
▼ pass
Benchmark (tinyllama, c=[1,5])
│
├─ p95 > 60s? ──▶ ✗ Fail build + upload results artifact
│
└─ p95 ≤ 60s? ──▶ ✓ Upload results artifact
Benchmark results are uploaded as workflow artifacts and (for PRs) posted as a comment.
llm-bench/
├── benchmarks/
│ ├── metrics.py # Pure computation: percentiles, aggregation, SLO checks
│ ├── runner.py # Async load generator (NDJSON stream parsing, concurrency sweep)
│ └── reporter.py # JSON + HTML report generation
├── config/
│ ├── benchmark_config.yaml # All tunables in one place
│ └── prometheus.yml # Prometheus scrape config
├── dashboards/
│ └── grafana/llm_bench.json # Importable Grafana dashboard
├── tests/
│ └── test_metrics.py # Unit tests (no network required)
├── .github/workflows/
│ └── benchmark-ci.yml # CI pipeline with SLO gate
├── scripts/setup_ollama.sh # One-shot environment setup
├── docker-compose.yml # Ollama + Prometheus + Grafana stack
├── main.py # CLI entry point
└── requirements.txt
Add a new metric: Add a field to RequestResult in metrics.py, populate it in runner.py, include it in AggregatedMetrics, and surface it in reporter.py.
Test a different model: Change ollama.model in the config. Ollama supports Mistral, Phi-3, Gemma, and many others.
Push metrics to Prometheus: The prometheus-client library is already installed. Wrap run_benchmark() in a push-gateway call after each run.
Add p-value statistical comparison between runs: Load two benchmark_latest.json files and use scipy.stats.mannwhitneyu to test whether a model update produced a statistically significant latency change.
| Layer | Tool | Rationale |
|---|---|---|
| LLM server | Ollama | Local, free, supports 50+ models |
| Load generation | Python asyncio + aiohttp | Native async concurrency, no binary dependency |
| Metrics math | NumPy | Vectorised percentile computation |
| CLI | Click + Rich | Professional terminal UX |
| Reporting | Jinja2 + Chart.js | Zero-build-step HTML dashboards |
| Observability | Prometheus + Grafana | Industry-standard stack |
| CI | GitHub Actions | Free tier, workflow artifacts, PR comments |