Chronik Stream

An event-native data platform that unifies streaming, analytics, search, and semantic retrieval in a single binary.

Write events once. Query them immediately — with SQL, full-text search, vector similarity, or Kafka consumers. No ETL. No glue. No pipeline tax.

The Problem

Modern real-time architectures require stitching together many systems:

Producer -> Kafka -> Flink/Spark -> Data Warehouse -> BI Tools
                 |-> Kafka Connect -> Elasticsearch -> Search API
                 |-> Custom ETL -> Vector DB -> AI/Agent API
                 |-> S3 Sink -> Athena/Trino -> Ad-hoc Queries

Each arrow is a failure point, a latency penalty, and an operational burden. Data is copied, reformatted, and re-indexed across systems that each serve one purpose. By the time an event is searchable, queryable, and semantically retrievable, it has passed through 4-8 systems with minutes to hours of delay.

The Chronik Approach

Chronik treats the event log as the source of truth and derives all query capabilities directly from it — at ingestion time, in the same process:

Producer -> Chronik -> Done.
               |
               |  (internally, at write time)
               |-> WAL (durability, Kafka consumers)
               |-> Arrow/Parquet (SQL via DataFusion)
               |-> Tantivy (full-text search, BM25)
               |-> HNSW (vector embeddings, semantic search)

Events become queryable across all modalities within seconds of ingestion. There is no ETL pipeline between the log and the query engine — they are the same system.

┌─────────────────────────────────────────────────────────────┐
│                        Chronik Stream                        │
│                                                              │
│  Kafka Protocol (9092)          Unified API (6092)           │
│  ├─ Produce / Fetch             ├─ /_sql    (DataFusion)     │
│  ├─ Consumer Groups             ├─ /_search (BM25/Tantivy)   │
│  ├─ Transactions                ├─ /_vector (HNSW)           │
│  └─ 24 Kafka APIs               ├─ /_query  (Orchestrator)   │
│                                  └─ /admin   (Cluster ops)   │
│                                                              │
│  One binary. One log. Multiple query modalities.             │
└─────────────────────────────────────────────────────────────┘

This replaces what traditionally requires Kafka + ClickHouse + Elasticsearch + a vector database — with one Rust binary that speaks the Kafka wire protocol.

Per-Topic Capability Model

Features are orthogonal and enabled per topic. A topic with no features enabled is a pure Kafka topic. Each capability adds a background indexing path — nothing changes in the write path.

# Pure Kafka — no extras, just streaming
kafka-topics.sh --create --topic logs

# SQL analytics — enables /_sql queries
kafka-topics.sh --create --topic orders \
  --config columnar.enabled=true

# Full-text search — enables /_search queries
kafka-topics.sh --create --topic documents \
  --config searchable.enabled=true

# Vector search — enables /_vector semantic queries
kafka-topics.sh --create --topic support-tickets \
  --config vector.enabled=true \
  --config vector.embedding.provider=openai

# All capabilities on a single topic
kafka-topics.sh --create --topic events \
  --config columnar.enabled=true \
  --config searchable.enabled=true \
  --config vector.enabled=true

Enable what you need, pay only for what you use. A topic with nothing enabled has zero indexing overhead.

What Gets Enabled

Capability	Config	Query Endpoint	Engine	Latency
Streaming	always on	Kafka Fetch (9092)	WAL	< 1ms
SQL	`columnar.enabled=true`	`POST /_sql`	DataFusion + Arrow/Parquet	0-200ms
Search	`searchable.enabled=true`	`POST /_search`	Tantivy (BM25)	1-500ms
Vector	`vector.enabled=true`	`POST /_vector/:topic/search`	HNSW + embeddings	5-100ms
Orchestrator	automatic	`POST /_query`	RRF multi-backend fusion	10-500ms

How It Works Internally

All indexing happens asynchronously after the produce ack. The write path is unaffected:

Producer -> WAL (fsync) -> Produce Ack      (~2-10ms, same as Kafka)
               |
               v  (background, async)
          WalIndexer
               |
    ┌──────────┼──────────┐
    v          v          v
 Tantivy    Parquet    Embeddings
 (search)   (SQL)      (vector)

Query availability after ingestion:

Kafka Fetch: immediate
SQL (hot buffer): ~1 second
Full-text search: 30-60 seconds
Vector search: 30-90 seconds (includes embedding API call)

Architecture Comparison

What Chronik Replaces

Capability	Traditional Stack	Chronik
Event streaming	Apache Kafka	Built-in (Kafka protocol)
Stream processing	Flink / Spark Streaming	DataFusion SQL at query time
Data warehouse	ClickHouse / BigQuery	Arrow hot buffer + Parquet cold storage
Full-text search	Elasticsearch / OpenSearch	Tantivy (BM25), validated above ES quality
Vector search	Pinecone / Weaviate / Qdrant	HNSW with pluggable embeddings
ETL pipelines	Kafka Connect / Debezium / dbt	Eliminated — indexing is built-in
Object storage	S3 sink connector	Native S3/GCS/Azure tiered storage

What Chronik Does Not Replace

Chronik is not a general-purpose OLAP database, a graph database, or a batch processing framework. It is purpose-built for event data: append-only logs where records arrive in time order and are queried by content, time range, or semantic similarity.

Benchmarks and Validation

Validated on a 3-node bare metal cluster (Dell Xeon E5-2667v4, 32 cores, 256GB RAM, 1GbE) running MicroK8s.

Streaming Throughput

Configuration	Messages/sec	Throughput	Errors
Cluster, 12 ingestors, 256B	837,284 msg/s	245 MB/s	0.00%
Cluster, 36 ingestors, 1KB	420,609 msg/s	411 MB/s	0.00%
Standalone, acks=1	309,000 msg/s	—	0.00%

Zero data loss across 1B+ messages. See BARE_METAL_PERFORMANCE.md.

Search Quality (WANDS Benchmark — v2.4.0)

System	NDCG@10	Notes
Elasticsearch BM25	~0.50-0.56	Published baselines
Chronik BM25 (Tantivy)	0.5927	+6-18% vs Elasticsearch
Elasticsearch Hybrid	~0.58-0.65	Published baselines
Chronik Hybrid (BM25 + Vector, RRF)	0.6095	Within ES hybrid range

Tested against the WANDS dataset (42,994 products, 480 queries). See docs/DATASET_VALIDATION_REPORT.md.

Scale Validation (Amazon Reviews — v2.4.0)

Dataset	Records	Text Search p50	SQL p50	Errors
Amazon Appliances	2.1M	2.8ms @ 1000 VUs	43ms	0.00%
Amazon Grocery	14.3M	8.7ms @ 1000 VUs	360ms	0.00%
Amazon Electronics	43.9M	112ms @ 1000 VUs	185ms	0.00%

Use Cases

Operational Analytics

Replace Kafka + Flink + ClickHouse. Produce events via Kafka protocol. Query with SQL. No pipeline.

# Produce events from your application (any Kafka client)
producer.send('orders', {'user_id': 42, 'amount': 99.50, 'city': 'Berlin'})

# Query immediately via SQL
curl -X POST http://localhost:6092/_sql \
  -d '{"query": "SELECT city, COUNT(*), AVG(amount) FROM orders GROUP BY city"}'

Real-Time Search

Replace Kafka + Kafka Connect + Elasticsearch. Events become searchable within seconds.

# Full-text search across all messages
curl -X POST http://localhost:6092/_search \
  -d '{"index": "support-tickets", "query": {"match": {"_all": "login timeout error"}}}'

AI and Agent Memory

Replace Kafka + custom ETL + Pinecone/Weaviate. Events are embedded and indexed for semantic retrieval automatically.

# Semantic search — "what happened recently that's similar to this?"
curl -X POST http://localhost:6092/_vector/incidents/search \
  -d '{"query": "database connection pool exhausted", "k": 10}'

# Hybrid search — combine keyword precision with semantic recall
curl -X POST http://localhost:6092/_vector/incidents/hybrid \
  -d '{"query": "OOM kill postgres", "k": 10, "vector_weight": 0.7}'

Multi-Modal Retrieval

Use the query orchestrator to search across modalities and topics in a single request:

curl -X POST http://localhost:6092/_query \
  -d '{
    "query": "payment failures in EU",
    "topics": ["orders", "support-tickets", "incidents"],
    "backends": ["search", "vector"],
    "k": 20
  }'

Results are fused via Reciprocal Rank Fusion (RRF) and returned as a single ranked list.

Quick Start

Docker (Recommended)

# Single node — just works
docker run -d -p 9092:9092 -p 6092:6092 \
  -e CHRONIK_ADVERTISED_ADDR=localhost \
  ghcr.io/lspecian/chronik-stream:latest start

# With persistent storage
docker run -d --name chronik \
  -p 9092:9092 -p 6092:6092 \
  -v chronik-data:/data \
  -e CHRONIK_ADVERTISED_ADDR=localhost \
  ghcr.io/lspecian/chronik-stream:latest start

IMPORTANT: CHRONIK_ADVERTISED_ADDR is required when binding to 0.0.0.0 or running in Docker. Without it, clients receive 0.0.0.0:9092 and fail to connect.

Binary

# Linux x86_64
curl -L https://github.com/lspecian/chronik-stream/releases/latest/download/chronik-server-linux-amd64.tar.gz | tar xz
./chronik-server start

# macOS (Apple Silicon)
curl -L https://github.com/lspecian/chronik-stream/releases/latest/download/chronik-server-darwin-arm64.tar.gz | tar xz
./chronik-server start

From Source

git clone https://github.com/lspecian/chronik-stream.git
cd chronik-stream
cargo build --release --bin chronik-server
./target/release/chronik-server start

Test with a Kafka Client

from kafka import KafkaProducer, KafkaConsumer

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    api_version=(0, 10, 0)
)
producer.send('test-topic', b'Hello Chronik!')
producer.flush()

consumer = KafkaConsumer(
    'test-topic',
    bootstrap_servers='localhost:9092',
    api_version=(0, 10, 0),
    auto_offset_reset='earliest'
)
for message in consumer:
    print(f"Received: {message.value}")

SQL API

Query Kafka topics directly with SQL. No external database required.

# Create a columnar-enabled topic
curl -X POST http://localhost:6092/topics \
  -H "Content-Type: application/json" \
  -d '{"name": "orders", "partitions": 3, "config": {"columnar.enabled": "true"}}'

# Produce data via any Kafka client, then query immediately
curl -X POST http://localhost:6092/_sql \
  -d '{"query": "SELECT city, COUNT(*) as cnt, AVG(amount) FROM orders GROUP BY city ORDER BY cnt DESC LIMIT 10"}'

# Recent data (0-1ms from in-memory hot buffer)
curl -X POST http://localhost:6092/_sql \
  -d '{"query": "SELECT * FROM orders_hot WHERE amount > 100 LIMIT 10"}'

# Historical data (Parquet files)
curl -X POST http://localhost:6092/_sql \
  -d '{"query": "SELECT * FROM orders_cold WHERE _partition = 0 LIMIT 10"}'

JSON message values are automatically inferred into typed columns (Int64, Float64, Boolean, Utf8). Query with standard SQL — no UDFs or schema definitions needed.

Tables per topic:

Table	Source	Latency
`{topic}_hot`	In-memory Arrow buffer (recent data)	0-1ms
`{topic}_cold`	Parquet files (historical data)	1-200ms
`{topic}`	Unified view (hot + cold)	1-200ms

See docs/COLUMNAR_STORAGE_GUIDE.md for full SQL reference.

Cluster Mode

Chronik supports multi-node clusters with Raft consensus, automatic replication, and zero-downtime operations. Minimum 3 nodes for quorum.

# Start a 3-node cluster (one terminal per node)
./chronik-server start --config cluster-node1.toml
./chronik-server start --config cluster-node2.toml
./chronik-server start --config cluster-node3.toml

# cluster-node1.toml
node_id = 1
replication_factor = 3
min_insync_replicas = 2

[[peers]]
id = 1
kafka = "node1:9092"
wal = "node1:9291"
raft = "node1:5001"

[[peers]]
id = 2
kafka = "node2:9092"
wal = "node2:9291"
raft = "node2:5001"

[[peers]]
id = 3
kafka = "node3:9092"
wal = "node3:9291"
raft = "node3:5001"

Cluster management:

# Add a node (zero downtime)
./chronik-server cluster add-node 4 --kafka node4:9092 --wal node4:9291 --raft node4:5001 --config cluster.toml

# Check status
./chronik-server cluster status --config cluster.toml

# Remove a node gracefully
./chronik-server cluster remove-node 4 --config cluster.toml

See docs/RUNNING_A_CLUSTER.md for the complete cluster guide.

Kafka Compatibility

Chronik implements 24 Kafka wire protocol APIs and is tested with real clients:

API	Versions	API	Versions
Produce	v0-v9	ApiVersions	v0-v3
Fetch	v0-v13	CreateTopics	v0-v7
ListOffsets	v0-v7	DeleteTopics	v0-v6
Metadata	v0-v12	CreatePartitions	v0-v3
OffsetCommit	v0-v8	InitProducerId	v0-v4
OffsetFetch	v0-v8	AddPartitionsToTxn	v0-v3
FindCoordinator	v0-v4	AddOffsetsToTxn	v0-v3
JoinGroup	v0-v9	EndTxn	v0-v3
SyncGroup	v0-v5	TxnOffsetCommit	v0-v3
Heartbeat	v0-v4	SaslHandshake	v0-v1
LeaveGroup	v0-v5	SaslAuthenticate	v0-v2
DescribeGroups	v0-v5	ListGroups	v0-v4

Tested clients: kafka-python, confluent-kafka, KSQLDB, Apache Flink.

Chronik also provides a Confluent-compatible Schema Registry (Avro, JSON Schema, Protobuf) on the admin API port. See docs/SCHEMA_REGISTRY.md.

Configuration Reference

Commands

chronik-server start [OPTIONS]          # Start server (auto-detects single-node or cluster)
chronik-server cluster <SUBCOMMAND>     # Manage cluster (add-node, remove-node, status)
chronik-server version                  # Display version info
chronik-server compact <SUBCOMMAND>     # Manage WAL compaction

Key Environment Variables

# Server
CHRONIK_DATA_DIR              # Data directory (default: ./data)
CHRONIK_ADVERTISED_ADDR       # Address advertised to Kafka clients (REQUIRED for Docker)
CHRONIK_UNIFIED_API_PORT      # Unified API port (default: 6092)
RUST_LOG                      # Log level: error, warn, info, debug, trace

# Performance
CHRONIK_WAL_PROFILE           # WAL batch interval: low (2ms), medium (10ms), high (50ms), ultra (100ms)
CHRONIK_PRODUCE_PROFILE       # Flush profile: low-latency, balanced, high-throughput, extreme

# SQL / Columnar
CHRONIK_HOT_BUFFER_ENABLED    # Enable in-memory SQL queries (default: true)
CHRONIK_HOT_BUFFER_MAX_RECORDS  # Max records per partition in hot buffer (default: 100000)

# Vector / Embeddings
OPENAI_API_KEY                # OpenAI API key for vector embeddings
CHRONIK_EMBEDDING_PROVIDER    # Embedding provider: openai, external

# Object Storage (tiered storage)
OBJECT_STORE_BACKEND          # Backend: s3, gcs, azure, local (default: local)
S3_BUCKET                     # S3 bucket name
S3_REGION                     # AWS region
S3_ENDPOINT                   # Custom endpoint (e.g., MinIO)

# Cluster
CHRONIK_ADMIN_API_KEY         # Admin API authentication (REQUIRED for production clusters)

Security

# SASL authentication (PLAIN, SCRAM-SHA-256, SCRAM-SHA-512)
CHRONIK_SASL_USERS='myuser:securepass123,app:app-secret' ./chronik-server start

# Disable default test users in production
CHRONIK_SASL_NO_DEFAULTS=1 ./chronik-server start

Admin API and Schema Registry support API key and HTTP Basic Auth respectively. See docs/ADMIN_API_SECURITY.md.

Performance Tuning

WAL Profiles

The WAL auto-detects system resources and selects an appropriate batch interval. Override when needed:

CHRONIK_WAL_PROFILE=low        # Containers, small VMs (2ms batch)
CHRONIK_WAL_PROFILE=medium     # Typical servers (10ms batch)
CHRONIK_WAL_PROFILE=high       # Dedicated servers (50ms batch)
CHRONIK_WAL_PROFILE=ultra      # Maximum throughput (100ms batch)

Producer Flush Profiles

Profile	Flush Interval	Buffer	Use Case
`low-latency`	10ms	16MB	Real-time analytics, instant messaging
`balanced`	100ms	32MB	General-purpose workloads
`high-throughput`	500ms	128MB	Data pipelines, ETL, batch processing
`extreme`	2000ms	512MB	Bulk ingestion, data migrations

Project Structure

chronik-stream/
├── crates/
│   ├── chronik-server/       # Unified server binary
│   ├── chronik-protocol/     # Kafka wire protocol (24 APIs)
│   ├── chronik-wal/          # Write-Ahead Log, group commit, recovery
│   ├── chronik-storage/      # Storage abstraction, segment I/O, WalIndexer
│   ├── chronik-columnar/     # Arrow/Parquet, DataFusion SQL, HNSW vector search
│   ├── chronik-embeddings/   # Embedding providers (OpenAI, external)
│   ├── chronik-search/       # Tantivy full-text search
│   ├── chronik-query/        # Query processing
│   ├── chronik-common/       # Shared types, metadata traits
│   ├── chronik-raft/         # Raft consensus
│   ├── chronik-raft-bridge/  # Raft integration bridge
│   ├── chronik-operator/     # Kubernetes operator (ChronikCluster CRD)
│   ├── chronik-auth/         # SASL, TLS, ACLs
│   ├── chronik-monitoring/   # Prometheus metrics
│   ├── chronik-config/       # Configuration management
│   ├── chronik-backup/       # Backup and disaster recovery
│   └── chronik-bench/        # Benchmarking tool
├── tests/                    # Integration tests
├── docs/                     # Guides and references
└── .github/workflows/        # CI/CD

Project Status

Chronik Stream is in active development. The core streaming, SQL, search, and vector capabilities are validated at scale (43.9M records, 3-node cluster). The Kafka protocol layer has been tested with real clients across production workloads.

Current version: v2.4.1 — See CHANGELOG.md for release history.

Roadmap priorities:

Cross-encoder reranking for search quality (Phase 10)
Query governance and policy controls (Phase 11)
Production hardening and ecosystem integrations

See docs/ROADMAP_RELEVANCE_ENGINE.md for the full roadmap.

Documentation

Guide	Description
CHANGELOG.md	Release history
docs/RUNNING_A_CLUSTER.md	Cluster setup guide
docs/COLUMNAR_STORAGE_GUIDE.md	SQL and Parquet storage
docs/SEARCHABLE_TOPICS.md	Full-text search
docs/VECTOR_SEARCH_GUIDE.md	Vector search and embeddings
docs/SCHEMA_REGISTRY.md	Confluent-compatible Schema Registry
docs/KSQL_INTEGRATION_GUIDE.md	KSQLDB integration
docs/API_REFERENCE.md	Unified REST API reference
docs/DATASET_VALIDATION_REPORT.md	Scale and quality validation
docs/DISASTER_RECOVERY.md	Backup and recovery
docs/ADMIN_API_SECURITY.md	Admin API security
BASELINE_PERFORMANCE.md	Performance benchmarks
BARE_METAL_PERFORMANCE.md	Cluster throughput testing

Contributing

Contributions are welcome. Please feel free to submit a Pull Request.

License

Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 342 Commits
.cargo		.cargo
.github		.github
charts/chronik-operator		charts/chronik-operator
crates		crates
docs		docs
examples		examples
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
BARE_METAL_PERFORMANCE.md		BARE_METAL_PERFORMANCE.md
BASELINE_PERFORMANCE.md		BASELINE_PERFORMANCE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.docker.toml		Cargo.docker.toml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
Dockerfile.binary		Dockerfile.binary
Dockerfile.operator		Dockerfile.operator
Dockerfile.operator.binary		Dockerfile.operator.binary
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
kafkactl-local.yml		kafkactl-local.yml

Folders and files

Latest commit

History

Repository files navigation

Chronik Stream

The Problem

The Chronik Approach

Per-Topic Capability Model

What Gets Enabled

How It Works Internally

Architecture Comparison

What Chronik Replaces

What Chronik Does Not Replace

Benchmarks and Validation

Streaming Throughput

Search Quality (WANDS Benchmark — v2.4.0)

Scale Validation (Amazon Reviews — v2.4.0)

Use Cases

Operational Analytics

Real-Time Search

AI and Agent Memory

Multi-Modal Retrieval

Quick Start

Docker (Recommended)

Binary

From Source

Test with a Kafka Client

SQL API

Cluster Mode

Kafka Compatibility

Configuration Reference

Commands

Key Environment Variables

Security

Performance Tuning

WAL Profiles

Producer Flush Profiles

Project Structure

Project Status

Documentation

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 95

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages