Modelship

Self-hosted, multi-model AI inference server. Runs LLMs alongside specialized models (TTS, speech-to-text, embeddings, image generation) on GPU or CPU, exposing an OpenAI-compatible API. Built on Ray Serve with pluggable inference backends: vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, llama.cpp for high-efficiency GGUF models on CPU, Diffusers for image generation, and a plugin system for custom backends.

Why Modelship?

Most self-hosted inference tools focus on running a single model. Modelship is for when you need multiple models running simultaneously — an LLM, a TTS engine, a speech-to-text model, an embedding model, and an image generator — all behind a single OpenAI-compatible API, with fine-grained control over GPU memory allocation across them.

One server, many models — run a full AI stack (chat + TTS + STT + embeddings + image gen) on a single machine instead of juggling separate services
GPU memory control — allocate exact GPU fractions per model (e.g. 70% for the LLM, 5% for TTS) so everything fits on your hardware
Mix and match backends — use vLLM for high-throughput GPU inference, Transformers or llama.cpp for CPU-only workloads, Diffusers for images, and plugins for custom backends — in the same deployment
Drop-in OpenAI replacement — any OpenAI SDK client works out of the box, making it easy to integrate with existing apps and tools like Home Assistant

Architecture

graph TD
    Client["Client (OpenAI SDK / curl)"]
    API["FastAPI Gateway<br/>OpenAI-compatible API<br/>:8000"]

    Client -->|HTTP| API
    API -->|round-robin| LLM_GPU
    API -->|round-robin| LLM_CPU
    API -->|round-robin| TTS
    API -->|round-robin| STT
    API -->|round-robin| EMB
    API -->|round-robin| IMG

    subgraph GPU0["GPU 0 — vLLM"]
        LLM_GPU["LLM Deployment<br/>e.g. Llama 3.1 8B<br/>70% GPU"]
        TTS["TTS Deployment<br/>e.g. Kokoro 82M<br/>5% GPU"]
    end

    subgraph GPU1["GPU 1 — Mixed backends"]
        STT["STT Deployment (vLLM)<br/>e.g. Whisper Large<br/>50% GPU"]
        EMB["Embedding Deployment<br/>e.g. Nomic Embed<br/>50% GPU"]
    end

    subgraph CPU["CPU — Transformers / llama.cpp"]
        LLM_CPU["LLM Deployment<br/>e.g. Qwen3-0.6B<br/>CPU-only"]
        STT_CPU["STT Deployment<br/>e.g. Whisper Small<br/>CPU-only"]
    end

    subgraph GPU2["GPU 2 — Diffusers"]
        IMG["Image Generation<br/>e.g. SDXL Turbo<br/>35% GPU"]
    end

Each model runs as an isolated Ray Serve deployment with its own lifecycle, health checks, and resource budget. Five inference backends are available:

Backend	Best for	GPU required
vLLM	High-throughput chat, embeddings, transcription	Yes
llama.cpp	High-efficiency quantized GGUF models (chat, embeddings)	No
Transformers	Chat, embeddings, transcription, TTS on CPU or lightweight GPU	No
Diffusers	Image generation	Yes
Custom (plugins)	TTS backends (Kokoro ONNX, Bark, Orpheus), STT backends (whisper.cpp)	No

Models can be deployed across multiple GPUs, run on CPU-only, or both — multiple deployments of the same model (e.g. one on GPU via vLLM, one on CPU via Transformers) are load-balanced with round-robin routing. Each deployment can also scale horizontally with num_replicas. ...

Requirements

Docker (or Python 3.12+ with uv for local development)
NVIDIA GPU (optional) — 16 GB+ VRAM recommended for a full stack (LLM + TTS + STT + embeddings) via vLLM; 8 GB is sufficient for lighter setups. Not required when using the Transformers backend on CPU
NVIDIA Container Toolkit — required only when running GPU models in Docker
HuggingFace token for gated models

Features

Multi-model, multi-GPU — run chat, embedding, STT, TTS, and image generation models simultaneously across one or more GPUs with tunable per-model GPU memory allocation
CPU-only support — run models without a GPU using the Transformers backend (chat, embeddings, transcription, TTS). Useful for development, testing, or small models that don't need GPU acceleration
Multiple inference backends — vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, Diffusers for image generation, and a plugin system for custom backends
Per-model isolated deployments — each model runs in its own Ray Serve deployment with independent lifecycle, health checks, failure isolation, and configurable replica count
OpenAI-compatible API — drop-in replacement for any OpenAI SDK client
Streaming — SSE streaming for chat completions and TTS audio
Tool/function calling — auto tool choice with configurable parsers
Plugin system — opt-in TTS and STT backends installed as isolated uv workspace packages
Multi-GPU & hybrid routing — assign models to specific GPUs or run them on CPU-only; deploy the same model on both GPU and CPU and requests are load-balanced via round-robin; full tensor parallelism support for large models spanning multiple GPUs
Client disconnect detection — cancels in-flight inference when the client disconnects, freeing GPU resources immediately
Prometheus metrics & Grafana dashboard — built-in observability with custom modelship:* metrics, vLLM engine stats, and Ray cluster metrics on a single scrape endpoint; pre-built Grafana dashboard included
Ray dashboard — monitor deployments, resources, and request logs

Supported OpenAI Endpoints

Endpoint	Usecase
`POST /v1/chat/completions`	Chat / text generation (streaming and non-streaming)
`POST /v1/embeddings`	Text embeddings
`POST /v1/audio/transcriptions`	Speech-to-text
`POST /v1/audio/translations`	Audio translation
`POST /v1/audio/speech`	Text-to-speech (SSE streaming or single-response)
`POST /v1/images/generations`	Image generation
`GET /v1/models`	List available models

Quick Start

The fastest way to try Modelship: run a quantized 7B chat model on a laptop — no GPU required. Copy-paste this block and you'll have an OpenAI-compatible API on http://localhost:8000 in a few minutes (first run downloads ~4.5 GB of weights into ./models-cache).

mkdir -p models-cache && cat > models.yaml <<'EOF'
models:
  - name: qwen
    model: lmstudio-community/Qwen2.5-7B-Instruct-GGUF
    usecase: generate
    loader: llama_cpp
    num_cpus: 3
    llama_cpp_config:
      hf_filename: "*Q4_K_M.gguf"
EOF

docker run --rm --shm-size=8g \
  -v ./models.yaml:/modelship/config/models.yaml \
  -v ./models-cache:/.cache \
  -p 8000:8000 \
  ghcr.io/alez007/modelship:latest-cpu

Images are multi-arch (amd64 + arm64), so this works on Apple Silicon and ARM Linux hosts too.

Once the server is up (look for Deployed app 'modelship api' successfully), call it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello!"}]}'

Or point any OpenAI SDK at it — no code changes, just swap base_url:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="qwen",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

GPU (vLLM, Diffusers)

For high-throughput GPU inference, use the standard image and add --gpus all. You'll also need the NVIDIA Container Toolkit and an HF_TOKEN for gated models. Example models.yaml entries for vLLM, Diffusers, and multi-GPU setups live in docs/model-configuration.md; ready-to-run configs are in config/examples/.

docker run --rm --shm-size=8g --gpus all \
  -e HF_TOKEN=your_token_here \
  -e RAY_HEAD_GPU_NUM=1 \
  -v ./models.yaml:/modelship/config/models.yaml \
  -v ./models-cache:/.cache \
  -p 8000:8000 \
  ghcr.io/alez007/modelship:latest

Hitting an error? Check docs/troubleshooting.md.

Plugin Support

Modelship's TTS and STT systems are built around a plugin architecture — each backend is an opt-in package with its own isolated dependencies. Plugins ship inside this repo (plugins/) or can be installed from PyPI.

Built-in plugins:

Kokoro ONNX — lightweight TTS via ONNX Runtime (CPU or GPU)
Bark — multilingual TTS by Suno (GPU recommended)
Orpheus — expressive TTS
whisper.cpp — CPU-only STT via pywhispercpp

To enable plugins, pass them as extras at sync time:

uv sync --extra kokoroonnx
uv sync --extra kokoroonnx --extra whispercpp  # multiple plugins

When using Docker, set the MSHIP_PLUGINS environment variable:

MSHIP_PLUGINS=kokoroonnx,whispercpp

For a full guide on writing your own plugin, see Plugin Development.

Documentation

Development — dev environment setup, building, and running locally
Model Configuration — full models.yaml reference, GPU pinning, environment variables
Architecture — system design, request lifecycle, plugin loading
Plugin Development — writing custom TTS/STT backends
Home Assistant Integration — Wyoming protocol setup for voice automation
Monitoring & Logging — Prometheus metrics, Grafana dashboard, structured logging, health checks
Troubleshooting — common first-run errors and fixes
Roadmap — what's planned next and where to contribute

Monitoring

Modelship exposes Prometheus metrics (Ray cluster, Ray Serve, vLLM, and custom modelship:* metrics) through a single scrape endpoint on port 8079. Metrics are enabled by default — set MSHIP_METRICS=false to disable. A pre-built Grafana dashboard is included.

Logging supports structured JSON output (MSHIP_LOG_FORMAT=json) and request ID correlation across Ray actor boundaries. Logs can be shipped to a remote syslog server (--log-target syslog://host:514) or an OpenTelemetry collector (--otel-endpoint http://collector:4317). Set MSHIP_LOG_LEVEL to TRACE for full request/response payloads, or DEBUG for detailed diagnostics without payloads.

See Monitoring & Logging for full details.

Production Readiness

Modelship is actively used but not yet hardened for production. Key gaps today: no rate limiting, /health is a no-op, thin test coverage, no Helm chart, no Prometheus alerting rules. See the full Production Readiness Plan for the scorecard and roadmap.

Contributing

See CONTRIBUTING.md for guidelines on setting up the dev environment, code style, and submitting pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.devcontainer		.devcontainer
.github		.github
config/examples		config/examples
docs		docs
modelship		modelship
plugins		plugins
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.rayignore		.rayignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
start.py		start.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modelship

Why Modelship?

Architecture

Requirements

Features

Supported OpenAI Endpoints

Quick Start

GPU (vLLM, Diffusers)

Plugin Support

Documentation

Monitoring

Production Readiness

Contributing

About

Uh oh!

Releases 21

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Modelship

Why Modelship?

Architecture

Requirements

Features

Supported OpenAI Endpoints

Quick Start

GPU (vLLM, Diffusers)

Plugin Support

Documentation

Monitoring

Production Readiness

Contributing

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages