Self-hosted, multi-model AI inference server. Runs LLMs alongside specialized models (TTS, speech-to-text, embeddings, image generation) on GPU or CPU, exposing an OpenAI-compatible API. Built on Ray Serve with pluggable inference backends: vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, llama.cpp for high-efficiency GGUF models on CPU, Diffusers for image generation, and a plugin system for custom backends.
Most self-hosted inference tools focus on running a single model. Modelship is for when you need multiple models running simultaneously — an LLM, a TTS engine, a speech-to-text model, an embedding model, and an image generator — all behind a single OpenAI-compatible API, with fine-grained control over GPU memory allocation across them.
- One server, many models — run a full AI stack (chat + TTS + STT + embeddings + image gen) on a single machine instead of juggling separate services
- GPU memory control — allocate exact GPU fractions per model (e.g. 70% for the LLM, 5% for TTS) so everything fits on your hardware
- Mix and match backends — use vLLM for high-throughput GPU inference, Transformers or llama.cpp for CPU-only workloads, Diffusers for images, and plugins for custom backends — in the same deployment
- Drop-in OpenAI replacement — any OpenAI SDK client works out of the box, making it easy to integrate with existing apps and tools like Home Assistant
graph TD
Client["Client (OpenAI SDK / curl)"]
API["FastAPI Gateway<br/>OpenAI-compatible API<br/>:8000"]
Client -->|HTTP| API
API -->|round-robin| LLM_GPU
API -->|round-robin| LLM_CPU
API -->|round-robin| TTS
API -->|round-robin| STT
API -->|round-robin| EMB
API -->|round-robin| IMG
subgraph GPU0["GPU 0 — vLLM"]
LLM_GPU["LLM Deployment<br/>e.g. Llama 3.1 8B<br/>70% GPU"]
TTS["TTS Deployment<br/>e.g. Kokoro 82M<br/>5% GPU"]
end
subgraph GPU1["GPU 1 — Mixed backends"]
STT["STT Deployment (vLLM)<br/>e.g. Whisper Large<br/>50% GPU"]
EMB["Embedding Deployment<br/>e.g. Nomic Embed<br/>50% GPU"]
end
subgraph CPU["CPU — Transformers / llama.cpp"]
LLM_CPU["LLM Deployment<br/>e.g. Qwen3-0.6B<br/>CPU-only"]
STT_CPU["STT Deployment<br/>e.g. Whisper Small<br/>CPU-only"]
end
subgraph GPU2["GPU 2 — Diffusers"]
IMG["Image Generation<br/>e.g. SDXL Turbo<br/>35% GPU"]
end
Each model runs as an isolated Ray Serve deployment with its own lifecycle, health checks, and resource budget. Five inference backends are available:
| Backend | Best for | GPU required |
|---|---|---|
| vLLM | High-throughput chat, embeddings, transcription | Yes |
| llama.cpp | High-efficiency quantized GGUF models (chat, embeddings) | No |
| Transformers | Chat, embeddings, transcription, TTS on CPU or lightweight GPU | No |
| Diffusers | Image generation | Yes |
| Custom (plugins) | TTS backends (Kokoro ONNX, Bark, Orpheus), STT backends (whisper.cpp) | No |
Models can be deployed across multiple GPUs, run on CPU-only, or both — multiple deployments of the same model (e.g. one on GPU via vLLM, one on CPU via Transformers) are load-balanced with round-robin routing. Each deployment can also scale horizontally with num_replicas.
...
- Docker (or Python 3.12+ with
uvfor local development) - NVIDIA GPU (optional) — 16 GB+ VRAM recommended for a full stack (LLM + TTS + STT + embeddings) via vLLM; 8 GB is sufficient for lighter setups. Not required when using the Transformers backend on CPU
- NVIDIA Container Toolkit — required only when running GPU models in Docker
- HuggingFace token for gated models
- Multi-model, multi-GPU — run chat, embedding, STT, TTS, and image generation models simultaneously across one or more GPUs with tunable per-model GPU memory allocation
- CPU-only support — run models without a GPU using the Transformers backend (chat, embeddings, transcription, TTS). Useful for development, testing, or small models that don't need GPU acceleration
- Multiple inference backends — vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, Diffusers for image generation, and a plugin system for custom backends
- Per-model isolated deployments — each model runs in its own Ray Serve deployment with independent lifecycle, health checks, failure isolation, and configurable replica count
- OpenAI-compatible API — drop-in replacement for any OpenAI SDK client
- Streaming — SSE streaming for chat completions and TTS audio
- Tool/function calling — auto tool choice with configurable parsers
- Plugin system — opt-in TTS and STT backends installed as isolated uv workspace packages
- Multi-GPU & hybrid routing — assign models to specific GPUs or run them on CPU-only; deploy the same model on both GPU and CPU and requests are load-balanced via round-robin; full tensor parallelism support for large models spanning multiple GPUs
- Client disconnect detection — cancels in-flight inference when the client disconnects, freeing GPU resources immediately
- Prometheus metrics & Grafana dashboard — built-in observability with custom
modelship:*metrics, vLLM engine stats, and Ray cluster metrics on a single scrape endpoint; pre-built Grafana dashboard included - Ray dashboard — monitor deployments, resources, and request logs
| Endpoint | Usecase |
|---|---|
POST /v1/chat/completions |
Chat / text generation (streaming and non-streaming) |
POST /v1/embeddings |
Text embeddings |
POST /v1/audio/transcriptions |
Speech-to-text |
POST /v1/audio/translations |
Audio translation |
POST /v1/audio/speech |
Text-to-speech (SSE streaming or single-response) |
POST /v1/images/generations |
Image generation |
GET /v1/models |
List available models |
The fastest way to try Modelship: run a quantized 7B chat model on a laptop — no GPU required. Copy-paste this block and you'll have an OpenAI-compatible API on http://localhost:8000 in a few minutes (first run downloads ~4.5 GB of weights into ./models-cache).
mkdir -p models-cache && cat > models.yaml <<'EOF'
models:
- name: qwen
model: lmstudio-community/Qwen2.5-7B-Instruct-GGUF
usecase: generate
loader: llama_cpp
num_cpus: 3
llama_cpp_config:
hf_filename: "*Q4_K_M.gguf"
EOF
docker run --rm --shm-size=8g \
-v ./models.yaml:/modelship/config/models.yaml \
-v ./models-cache:/.cache \
-p 8000:8000 \
ghcr.io/alez007/modelship:latest-cpuImages are multi-arch (amd64 + arm64), so this works on Apple Silicon and ARM Linux hosts too.
Once the server is up (look for Deployed app 'modelship api' successfully), call it:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello!"}]}'Or point any OpenAI SDK at it — no code changes, just swap base_url:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
model="qwen",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)For high-throughput GPU inference, use the standard image and add --gpus all. You'll also need the NVIDIA Container Toolkit and an HF_TOKEN for gated models. Example models.yaml entries for vLLM, Diffusers, and multi-GPU setups live in docs/model-configuration.md; ready-to-run configs are in config/examples/.
docker run --rm --shm-size=8g --gpus all \
-e HF_TOKEN=your_token_here \
-e RAY_HEAD_GPU_NUM=1 \
-v ./models.yaml:/modelship/config/models.yaml \
-v ./models-cache:/.cache \
-p 8000:8000 \
ghcr.io/alez007/modelship:latestHitting an error? Check docs/troubleshooting.md.
Modelship's TTS and STT systems are built around a plugin architecture — each backend is an opt-in package with its own isolated dependencies. Plugins ship inside this repo (plugins/) or can be installed from PyPI.
Built-in plugins:
- Kokoro ONNX — lightweight TTS via ONNX Runtime (CPU or GPU)
- Bark — multilingual TTS by Suno (GPU recommended)
- Orpheus — expressive TTS
- whisper.cpp — CPU-only STT via
pywhispercpp
To enable plugins, pass them as extras at sync time:
uv sync --extra kokoroonnx
uv sync --extra kokoroonnx --extra whispercpp # multiple pluginsWhen using Docker, set the MSHIP_PLUGINS environment variable:
MSHIP_PLUGINS=kokoroonnx,whispercpp
For a full guide on writing your own plugin, see Plugin Development.
- Development — dev environment setup, building, and running locally
- Model Configuration — full
models.yamlreference, GPU pinning, environment variables - Architecture — system design, request lifecycle, plugin loading
- Plugin Development — writing custom TTS/STT backends
- Home Assistant Integration — Wyoming protocol setup for voice automation
- Monitoring & Logging — Prometheus metrics, Grafana dashboard, structured logging, health checks
- Troubleshooting — common first-run errors and fixes
- Roadmap — what's planned next and where to contribute
Modelship exposes Prometheus metrics (Ray cluster, Ray Serve, vLLM, and custom modelship:* metrics) through a single scrape endpoint on port 8079. Metrics are enabled by default — set MSHIP_METRICS=false to disable. A pre-built Grafana dashboard is included.
Logging supports structured JSON output (MSHIP_LOG_FORMAT=json) and request ID correlation across Ray actor boundaries. Logs can be shipped to a remote syslog server (--log-target syslog://host:514) or an OpenTelemetry collector (--otel-endpoint http://collector:4317). Set MSHIP_LOG_LEVEL to TRACE for full request/response payloads, or DEBUG for detailed diagnostics without payloads.
See Monitoring & Logging for full details.
Modelship is actively used but not yet hardened for production. Key gaps today: no rate limiting, /health is a no-op, thin test coverage, no Helm chart, no Prometheus alerting rules. See the full Production Readiness Plan for the scorecard and roadmap.
See CONTRIBUTING.md for guidelines on setting up the dev environment, code style, and submitting pull requests.