LLM Inference Benchmarking Tool

A production-style performance engineering toolkit for measuring and monitoring LLM inference latency, throughput, and concurrency behaviour — running entirely on local hardware via Ollama.

Built to demonstrate how performance engineering discipline applies to AI/ML inference systems: the same rigour you'd bring to a web service benchmark (percentile tracking, SLO gates, CI integration, dashboards) applied to the unique characteristics of autoregressive token generation.

Why This Matters

LLM inference has a fundamentally different performance profile from a typical REST API:

Concern	Traditional API	LLM Inference
Latency shape	Single response time	TTFT + streaming generation
Bottleneck	I/O, DB, network	GPU memory bandwidth, KV-cache
Concurrency model	Stateless horizontal scale	Batching, attention mechanisms
SLO design	p99 end-to-end	TTFT SLO + throughput floor

Understanding this distinction is what separates a performance engineer who can work on AI infra from one who cannot.

TTFT (Time to First Token) is the metric that maps directly to user-perceived latency in streaming chat interfaces. A 15-second generation with a 200ms TTFT feels responsive. A 2-second generation with a 1500ms TTFT feels broken.

Token throughput (tokens/s) is the primary capacity metric — it determines how many concurrent users a given model deployment can serve within your quality-of-service budget.

Architecture

┌──────────────────────────────────────────────────────┐
│                  main.py (CLI)                        │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  runner.py   │  │  metrics.py  │  │reporter.py │ │
│  │  asyncio     │  │  percentile  │  │ JSON/HTML  │ │
│  │  aiohttp     │  │  aggregation │  │ Chart.js   │ │
│  └──────┬───────┘  └──────────────┘  └────────────┘ │
└─────────┼────────────────────────────────────────────┘
          │  NDJSON streaming (HTTP/1.1)
          ▼
    ┌───────────┐        ┌────────────┐      ┌─────────┐
    │  Ollama   │───────▶│ Prometheus │─────▶│ Grafana │
    │  :11434   │        │   :9090    │      │  :3000  │
    └───────────┘        └────────────┘      └─────────┘

Key design decisions:

asyncio + aiohttp for concurrent load generation — same mental model as k6 virtual users, but in Python without a separate binary
Streaming NDJSON parsing to capture TTFT accurately without buffering the full response
Semaphore-bounded concurrency so we model exactly N simultaneous users rather than flooding with goroutines
Warm-up phase excluded from measurements to avoid JIT/cache-cold bias (same principle as JMeter's ramp-up)
Pure computation layer (metrics.py) with no I/O so it's fully unit-testable without a running server

Metrics Collected

Metric	Unit	Why It Matters
TTFT	ms	User-perceived response start latency
End-to-end latency	ms	Full generation cost; capacity planning
Token throughput	tokens/s	Primary capacity metric for serving
p50 / p95 / p99 latency	ms	Tail latency reveals worst-case user experience
Requests per second	req/s	Wall-clock throughput of the inference server
Error rate	%	Timeout/failure rate under load

Quick Start

Prerequisites

Python 3.11+
Ollama installed and running

# 1. Clone and install Python dependencies
git clone https://github.com/yourusername/llm-bench
cd llm-bench
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Pull the model (one-time, ~1.3 GB)
bash scripts/setup_ollama.sh

# 3. Run the benchmark
python main.py

Results are written to results/benchmark_latest.json and results/report_latest.html.

Docker Compose (full observability stack)

docker compose up -d
# Grafana → http://localhost:3000  (admin / llmbench)
# Prometheus → http://localhost:9090
# Ollama → http://localhost:11434

CLI Options

# Use a different config file
python main.py --config config/benchmark_config.yaml

# Override concurrency levels
python main.py --concurrency 1 --concurrency 10 --concurrency 50

# CI mode: exit 1 if any SLO is violated
python main.py --ci

# Override output directory
python main.py --output-dir /tmp/bench-results

Configuration

All tunables live in config/benchmark_config.yaml:

ollama:
  model: "llama3.2:1b"       # swap to tinyllama for faster CI runs
  timeout_seconds: 120

benchmark:
  concurrency_levels: [1, 5, 10, 20]   # the "load steps" — like k6's stages
  requests_per_level: 20
  warmup_requests: 3                    # excluded from results

slo:
  p95_latency_ms: 30000     # CI fails if p95 exceeds this
  p99_latency_ms: 60000
  min_throughput_tps: 2

Sample Results

Results below are from llama3.2:1b on a MacBook Pro M2 (CPU only, no GPU offload):

Concurrency	p50 Latency	p95 Latency	p99 Latency	TTFT p50	Throughput
1	8,200ms	11,400ms	12,100ms	180ms	14.2 tps
5	19,800ms	28,600ms	31,200ms	420ms	12.8 tps
10	38,400ms	52,000ms	58,100ms	810ms	11.1 tps
20	71,200ms	95,300ms	104,000ms	1,600ms	9.4 tps

How to read this:

p50 latency scales roughly linearly with concurrency — this is expected for a CPU-bound single-process server with no request batching (Ollama's default mode). A GPU-backed vLLM deployment with continuous batching would show much flatter scaling.
TTFT increases with concurrency because queued requests wait longer before the model starts generating their response — this is the prefill queue building up.
Throughput drops under high concurrency (14.2 → 9.4 tps) because context-switching overhead increases and memory bandwidth is shared across more simultaneous KV-caches.
The p99/p95 ratio widens at high concurrency — this is the signature of a queuing system approaching saturation. Once utilisation exceeds ~70%, tail latency blows up disproportionately (Little's Law in action).

What a PE would do next: Plot the knee of the curve to find the concurrency level where p95 starts diverging from p50. That's your practical concurrency ceiling for this SLO.

CI/CD Integration

Every push to main runs a lightweight benchmark against tinyllama and fails the build if p95 latency exceeds the configured threshold. This prevents inference regressions from landing in production — the same gate pattern you'd use for API performance budgets.

GitHub Push
    │
    ▼
Unit Tests (no Ollama) ──fail──▶ ✗ Block merge
    │
    ▼ pass
Benchmark (tinyllama, c=[1,5])
    │
    ├─ p95 > 60s? ──▶ ✗ Fail build + upload results artifact
    │
    └─ p95 ≤ 60s? ──▶ ✓ Upload results artifact

Benchmark results are uploaded as workflow artifacts and (for PRs) posted as a comment.

Project Structure

llm-bench/
├── benchmarks/
│   ├── metrics.py      # Pure computation: percentiles, aggregation, SLO checks
│   ├── runner.py       # Async load generator (NDJSON stream parsing, concurrency sweep)
│   └── reporter.py     # JSON + HTML report generation
├── config/
│   ├── benchmark_config.yaml   # All tunables in one place
│   └── prometheus.yml          # Prometheus scrape config
├── dashboards/
│   └── grafana/llm_bench.json  # Importable Grafana dashboard
├── tests/
│   └── test_metrics.py         # Unit tests (no network required)
├── .github/workflows/
│   └── benchmark-ci.yml        # CI pipeline with SLO gate
├── scripts/setup_ollama.sh     # One-shot environment setup
├── docker-compose.yml          # Ollama + Prometheus + Grafana stack
├── main.py                     # CLI entry point
└── requirements.txt

Extending This Tool

Add a new metric: Add a field to RequestResult in metrics.py, populate it in runner.py, include it in AggregatedMetrics, and surface it in reporter.py.

Test a different model: Change ollama.model in the config. Ollama supports Mistral, Phi-3, Gemma, and many others.

Push metrics to Prometheus: The prometheus-client library is already installed. Wrap run_benchmark() in a push-gateway call after each run.

Add p-value statistical comparison between runs: Load two benchmark_latest.json files and use scipy.stats.mannwhitneyu to test whether a model update produced a statistically significant latency change.

Tech Stack

Layer	Tool	Rationale
LLM server	Ollama	Local, free, supports 50+ models
Load generation	Python asyncio + aiohttp	Native async concurrency, no binary dependency
Metrics math	NumPy	Vectorised percentile computation
CLI	Click + Rich	Professional terminal UX
Reporting	Jinja2 + Chart.js	Zero-build-step HTML dashboards
Observability	Prometheus + Grafana	Industry-standard stack
CI	GitHub Actions	Free tier, workflow artifacts, PR comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Benchmarking Tool

Why This Matters

Architecture

Metrics Collected

Quick Start

Prerequisites

Docker Compose (full observability stack)

CLI Options

Configuration

Sample Results

CI/CD Integration

Project Structure

Extending This Tool

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
benchmarks		benchmarks
config		config
dashboards/grafana		dashboards/grafana
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Benchmarking Tool

Why This Matters

Architecture

Metrics Collected

Quick Start

Prerequisites

Docker Compose (full observability stack)

CLI Options

Configuration

Sample Results

CI/CD Integration

Project Structure

Extending This Tool

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages