oMLX (forked)

LLM inference, optimized for your Mac
Continuous batching and tiered KV caching, managed directly from your menu bar.

Supporting Gemma 4!

This fork has support for the Gemma 4 models with tool calls and channel thinking parsing. Validated using Gemma-4-31B-it with Pi as the testing harnesses.

Validation Testing:

Running pi --tools read -ne -ns --model gemma-4-31b-it -m "Call read for README.md"

Current Limitations:

Enabling thinking with tools results in some form of generation coruption. Been unable to determine why this is ocurring with model generating rambling "// a single tool call" and the like output. During some runs it does generate some sensical output but something is wrong with how message template is applied.

Install · Quickstart · Features · Models · CLI Configuration · Benchmarks · oMLX.ai

Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.

oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.

Install

From Source

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .          # Core only
pip install -e ".[mcp]"   # With MCP (Model Context Protocol) support
python -m omlx.cli serve

Requires macOS 15.0+ (Sequoia), Python 3.10+, and Apple Silicon (M1/M2/M3/M4).

Features

Supports text LLMs, vision-language models (VLM), OCR models, embeddings, and rerankers on Apple Silicon.

Admin Dashboard

Web UI at /admin for real-time monitoring, model management, chat, benchmark, and per-model settings. Supports English, Korean, Japanese, and Chinese. All CDN dependencies are vendored for fully offline operation.

Vision-Language Models

Run VLMs with the same continuous batching and tiered KV cache stack as text LLMs. Supports multi-image chat, base64/URL/file image inputs, and tool calling with vision context. OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR) are auto-detected with optimized prompts.

Tiered KV Cache (Hot + Cold)

Block-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers:

Hot tier (RAM): Frequently accessed blocks stay in memory for fast access.
Cold tier (SSD): When the hot cache fills up, blocks are offloaded to SSD in safetensors format. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.

Continuous Batching

Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.

Claude Code Optimization

Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.

Multi-Model Serving

Load LLMs, VLMs, embedding models, and rerankers within the same server. Models are managed through a combination of automatic and manual controls:

LRU eviction: Least-recently-used models are evicted automatically when memory runs low.
Manual load/unload: Interactive status badges in the admin panel let you load or unload models on demand.
Model pinning: Pin frequently used models to keep them always loaded.
Per-model TTL: Set an idle timeout per model to auto-unload after a period of inactivity.
Process memory enforcement: Total memory limit (default: system RAM - 8GB) prevents system-wide OOM.

Per-Model Settings

Configure sampling parameters, chat template kwargs, TTL, model alias, model type override, and more per model directly from the admin panel. Changes apply immediately without server restart.

Model alias: set a custom API-visible name. /v1/models returns the alias, and requests accept both the alias and directory name.
Model type override: manually set a model as LLM or VLM regardless of auto-detection.

Built-in Chat

Chat directly with any loaded model from the admin panel. Supports conversation history, model switching, dark mode, reasoning model output, and image upload for VLM/OCR models.

Model Downloader

Search and download MLX models from HuggingFace directly in the admin dashboard. Browse model cards, check file sizes, and download with one click.

Integrations

Set up OpenClaw, OpenCode, and Codex directly from the admin dashboard with a single click. No manual config editing required.

Performance Benchmark

One-click benchmarking from the admin panel. Measures prefill (PP) and text generation (TG) tokens per second, with partial prefix cache hit testing for realistic performance numbers.

macOS Menubar App

Native PyObjC menubar app (not Electron). Start, stop, and monitor the server without opening a terminal. Includes persistent serving stats (survives restarts), auto-restart on crash, and in-app auto-update.

API Compatibility

Drop-in replacement for OpenAI and Anthropic APIs. Supports streaming usage stats (stream_options.include_usage), Anthropic adaptive thinking, and vision inputs (base64, URL).

Endpoint	Description
`POST /v1/chat/completions`	Chat completions (streaming)
`POST /v1/completions`	Text completions (streaming)
`POST /v1/messages`	Anthropic Messages API
`POST /v1/embeddings`	Text embeddings
`POST /v1/rerank`	Document reranking
`GET /v1/models`	List available models

Tool Calling & Structured Output

Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the tools parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:

Model Family	Format
Llama, Qwen, DeepSeek, etc.	JSON `<tool_call>`
Qwen3.5 Series	XML `<function=...>`
Gemma	`<start_function_call>`
GLM (4.7, 5)	`<arg_key>/<arg_value>` XML
MiniMax	Namespaced `<minimax:tool_call>`
Mistral	`[TOOL_CALLS]`
Kimi K2	`<\|tool_calls_section_begin\|>`
Longcat	`<longcat_tool_call>`

Models not listed above may still work if their chat template accepts tools and their output uses a recognized <tool_call> XML format. For tool-enabled streaming, assistant text is emitted incrementally while known tool-call control markup is suppressed from visible content; structured tool calls are emitted after parsing the completed turn.

Models

Point --model-dir at a directory containing MLX-format model subdirectories. Two-level organization folders (e.g., mlx-community/model-name/) are also supported.

~/models/
├── Step-3.5-Flash-8bit/
├── Qwen3-Coder-Next-8bit/
├── gpt-oss-120b-MXFP4-Q8/
├── Qwen3.5-122B-A10B-4bit/
└── bge-m3/

Models are auto-detected by type. You can also download models directly from the admin dashboard.

Type	Models
LLM	Any model supported by mlx-lm
VLM	Qwen3.5 Series, GLM-4V, Pixtral, and other mlx-vlm models
OCR	DeepSeek-OCR, DOTS-OCR, GLM-OCR
Embedding	BERT, BGE-M3, ModernBERT
Reranker	ModernBERT, XLM-RoBERTa

CLI Configuration

# Memory limit for loaded models
omlx serve --model-dir ~/models --max-model-memory 32GB

# Process-level memory limit (default: auto = RAM - 8GB)
omlx serve --model-dir ~/models --max-process-memory 80%

# Enable SSD cache for KV blocks
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

# Set in-memory hot cache size
omlx serve --model-dir ~/models --hot-cache-max-size 20%

# Adjust batch sizes
omlx serve --model-dir ~/models --prefill-batch-size 8 --completion-batch-size 32

# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json

# HuggingFace mirror endpoint (for restricted regions)
omlx serve --model-dir ~/models --hf-endpoint https://hf-mirror.com

# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-key
# Localhost-only: skip verification via admin panel global settings

All settings can also be configured from the web admin panel at /admin. Settings are persisted to ~/.omlx/settings.json, and CLI flags take precedence.

Architecture

FastAPI Server (OpenAI / Anthropic API)
    │
    ├── EnginePool (multi-model, LRU eviction, TTL, manual load/unload)
    │   ├── BatchedEngine (LLMs, continuous batching)
    │   ├── VLMEngine (vision-language models)
    │   ├── EmbeddingEngine
    │   └── RerankerEngine
    │
    ├── ProcessMemoryEnforcer (total memory limit, TTL checks)
    │
    ├── Scheduler (FCFS, configurable batch sizes)
    │   └── mlx-lm BatchGenerator
    │
    └── Cache Stack
        ├── PagedCacheManager (GPU, block-based, CoW, prefix sharing)
        ├── Hot Cache (in-memory tier, write-back)
        └── PagedSSDCacheManager (SSD cold tier, safetensors format)

Development

CLI Server

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e ".[dev]"
pytest -m "not slow"

macOS App

Requires Python 3.11+ and venvstacks (pip install venvstacks).

cd packaging

# Full build (venvstacks + app bundle + DMG)
python build.py

# Skip venvstacks (code changes only)
python build.py --skip-venv

# DMG only
python build.py --dmg-only

See packaging/README.md for details on the app bundle structure and layer configuration.

Contributing

Contributions are welcome! See Contributing Guide for details.

Bug fixes and improvements
Performance optimizations
Documentation improvements

License

Apache 2.0

Acknowledgments

MLX and mlx-lm by Apple
mlx-vlm - Vision-language model inference on Apple Silicon
vllm-mlx - oMLX started from vllm-mlx v0.1.0 and evolved significantly with multi-model serving, tiered KV caching, VLM with full paged cache support, an admin panel, and a macOS menu bar app
venvstacks - Portable Python environment layering for the macOS app bundle
mlx-embeddings - Embedding model support for Apple Silicon
llm-compressor - Reference AWQ implementation for MoE models, used as design reference for oQ weight equalization

Name		Name	Last commit message	Last commit date
Latest commit History 738 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Formula		Formula
docs		docs
gemma4-model		gemma4-model
omlx		omlx
packaging		packaging
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.ja.md		README.ja.md
README.ko.md		README.ko.md
README.md		README.md
README.zh.md		README.zh.md
mcp.example.json		mcp.example.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

oMLX (forked)

Supporting Gemma 4!

Install

From Source

Features

Admin Dashboard

Vision-Language Models

Tiered KV Cache (Hot + Cold)

Continuous Batching

Claude Code Optimization

Multi-Model Serving

Per-Model Settings

Built-in Chat

Model Downloader

Integrations

Performance Benchmark

macOS Menubar App

API Compatibility

Tool Calling & Structured Output

Models

CLI Configuration

Development

CLI Server

macOS App

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

oMLX (forked)

Supporting Gemma 4!

Install

From Source

Features

Admin Dashboard

Vision-Language Models

Tiered KV Cache (Hot + Cold)

Continuous Batching

Claude Code Optimization

Multi-Model Serving

Per-Model Settings

Built-in Chat

Model Downloader

Integrations

Performance Benchmark

macOS Menubar App

API Compatibility

Tool Calling & Structured Output

Models

CLI Configuration

Development

CLI Server

macOS App

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages