Skip to content

Amritasha/llm-bench

Repository files navigation

LLM Inference Benchmarking Tool

A production-style performance engineering toolkit for measuring and monitoring LLM inference latency, throughput, and concurrency behaviour — running entirely on local hardware via Ollama.

Built to demonstrate how performance engineering discipline applies to AI/ML inference systems: the same rigour you'd bring to a web service benchmark (percentile tracking, SLO gates, CI integration, dashboards) applied to the unique characteristics of autoregressive token generation.


Why This Matters

LLM inference has a fundamentally different performance profile from a typical REST API:

Concern Traditional API LLM Inference
Latency shape Single response time TTFT + streaming generation
Bottleneck I/O, DB, network GPU memory bandwidth, KV-cache
Concurrency model Stateless horizontal scale Batching, attention mechanisms
SLO design p99 end-to-end TTFT SLO + throughput floor

Understanding this distinction is what separates a performance engineer who can work on AI infra from one who cannot.

TTFT (Time to First Token) is the metric that maps directly to user-perceived latency in streaming chat interfaces. A 15-second generation with a 200ms TTFT feels responsive. A 2-second generation with a 1500ms TTFT feels broken.

Token throughput (tokens/s) is the primary capacity metric — it determines how many concurrent users a given model deployment can serve within your quality-of-service budget.


Architecture

┌──────────────────────────────────────────────────────┐
│                  main.py (CLI)                        │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  runner.py   │  │  metrics.py  │  │reporter.py │ │
│  │  asyncio     │  │  percentile  │  │ JSON/HTML  │ │
│  │  aiohttp     │  │  aggregation │  │ Chart.js   │ │
│  └──────┬───────┘  └──────────────┘  └────────────┘ │
└─────────┼────────────────────────────────────────────┘
          │  NDJSON streaming (HTTP/1.1)
          ▼
    ┌───────────┐        ┌────────────┐      ┌─────────┐
    │  Ollama   │───────▶│ Prometheus │─────▶│ Grafana │
    │  :11434   │        │   :9090    │      │  :3000  │
    └───────────┘        └────────────┘      └─────────┘

Key design decisions:

  • asyncio + aiohttp for concurrent load generation — same mental model as k6 virtual users, but in Python without a separate binary
  • Streaming NDJSON parsing to capture TTFT accurately without buffering the full response
  • Semaphore-bounded concurrency so we model exactly N simultaneous users rather than flooding with goroutines
  • Warm-up phase excluded from measurements to avoid JIT/cache-cold bias (same principle as JMeter's ramp-up)
  • Pure computation layer (metrics.py) with no I/O so it's fully unit-testable without a running server

Metrics Collected

Metric Unit Why It Matters
TTFT ms User-perceived response start latency
End-to-end latency ms Full generation cost; capacity planning
Token throughput tokens/s Primary capacity metric for serving
p50 / p95 / p99 latency ms Tail latency reveals worst-case user experience
Requests per second req/s Wall-clock throughput of the inference server
Error rate % Timeout/failure rate under load

Quick Start

Prerequisites

  • Python 3.11+
  • Ollama installed and running
# 1. Clone and install Python dependencies
git clone https://github.com/yourusername/llm-bench
cd llm-bench
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Pull the model (one-time, ~1.3 GB)
bash scripts/setup_ollama.sh

# 3. Run the benchmark
python main.py

Results are written to results/benchmark_latest.json and results/report_latest.html.

Docker Compose (full observability stack)

docker compose up -d
# Grafana → http://localhost:3000  (admin / llmbench)
# Prometheus → http://localhost:9090
# Ollama → http://localhost:11434

CLI Options

# Use a different config file
python main.py --config config/benchmark_config.yaml

# Override concurrency levels
python main.py --concurrency 1 --concurrency 10 --concurrency 50

# CI mode: exit 1 if any SLO is violated
python main.py --ci

# Override output directory
python main.py --output-dir /tmp/bench-results

Configuration

All tunables live in config/benchmark_config.yaml:

ollama:
  model: "llama3.2:1b"       # swap to tinyllama for faster CI runs
  timeout_seconds: 120

benchmark:
  concurrency_levels: [1, 5, 10, 20]   # the "load steps" — like k6's stages
  requests_per_level: 20
  warmup_requests: 3                    # excluded from results

slo:
  p95_latency_ms: 30000     # CI fails if p95 exceeds this
  p99_latency_ms: 60000
  min_throughput_tps: 2

Sample Results

Results below are from llama3.2:1b on a MacBook Pro M2 (CPU only, no GPU offload):

Concurrency p50 Latency p95 Latency p99 Latency TTFT p50 Throughput
1 8,200ms 11,400ms 12,100ms 180ms 14.2 tps
5 19,800ms 28,600ms 31,200ms 420ms 12.8 tps
10 38,400ms 52,000ms 58,100ms 810ms 11.1 tps
20 71,200ms 95,300ms 104,000ms 1,600ms 9.4 tps

How to read this:

  • p50 latency scales roughly linearly with concurrency — this is expected for a CPU-bound single-process server with no request batching (Ollama's default mode). A GPU-backed vLLM deployment with continuous batching would show much flatter scaling.
  • TTFT increases with concurrency because queued requests wait longer before the model starts generating their response — this is the prefill queue building up.
  • Throughput drops under high concurrency (14.2 → 9.4 tps) because context-switching overhead increases and memory bandwidth is shared across more simultaneous KV-caches.
  • The p99/p95 ratio widens at high concurrency — this is the signature of a queuing system approaching saturation. Once utilisation exceeds ~70%, tail latency blows up disproportionately (Little's Law in action).

What a PE would do next: Plot the knee of the curve to find the concurrency level where p95 starts diverging from p50. That's your practical concurrency ceiling for this SLO.


CI/CD Integration

Every push to main runs a lightweight benchmark against tinyllama and fails the build if p95 latency exceeds the configured threshold. This prevents inference regressions from landing in production — the same gate pattern you'd use for API performance budgets.

GitHub Push
    │
    ▼
Unit Tests (no Ollama) ──fail──▶ ✗ Block merge
    │
    ▼ pass
Benchmark (tinyllama, c=[1,5])
    │
    ├─ p95 > 60s? ──▶ ✗ Fail build + upload results artifact
    │
    └─ p95 ≤ 60s? ──▶ ✓ Upload results artifact

Benchmark results are uploaded as workflow artifacts and (for PRs) posted as a comment.


Project Structure

llm-bench/
├── benchmarks/
│   ├── metrics.py      # Pure computation: percentiles, aggregation, SLO checks
│   ├── runner.py       # Async load generator (NDJSON stream parsing, concurrency sweep)
│   └── reporter.py     # JSON + HTML report generation
├── config/
│   ├── benchmark_config.yaml   # All tunables in one place
│   └── prometheus.yml          # Prometheus scrape config
├── dashboards/
│   └── grafana/llm_bench.json  # Importable Grafana dashboard
├── tests/
│   └── test_metrics.py         # Unit tests (no network required)
├── .github/workflows/
│   └── benchmark-ci.yml        # CI pipeline with SLO gate
├── scripts/setup_ollama.sh     # One-shot environment setup
├── docker-compose.yml          # Ollama + Prometheus + Grafana stack
├── main.py                     # CLI entry point
└── requirements.txt

Extending This Tool

Add a new metric: Add a field to RequestResult in metrics.py, populate it in runner.py, include it in AggregatedMetrics, and surface it in reporter.py.

Test a different model: Change ollama.model in the config. Ollama supports Mistral, Phi-3, Gemma, and many others.

Push metrics to Prometheus: The prometheus-client library is already installed. Wrap run_benchmark() in a push-gateway call after each run.

Add p-value statistical comparison between runs: Load two benchmark_latest.json files and use scipy.stats.mannwhitneyu to test whether a model update produced a statistically significant latency change.


Tech Stack

Layer Tool Rationale
LLM server Ollama Local, free, supports 50+ models
Load generation Python asyncio + aiohttp Native async concurrency, no binary dependency
Metrics math NumPy Vectorised percentile computation
CLI Click + Rich Professional terminal UX
Reporting Jinja2 + Chart.js Zero-build-step HTML dashboards
Observability Prometheus + Grafana Industry-standard stack
CI GitHub Actions Free tier, workflow artifacts, PR comments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors