Skip to content

feat(bundle): llama.cpp CPU Qwen3-Embedding-0.6B embedding bundle#111

Merged
kh0pper merged 2 commits into
mainfrom
feat/llamacpp-cpu-embed-bundle
Jun 29, 2026
Merged

feat(bundle): llama.cpp CPU Qwen3-Embedding-0.6B embedding bundle#111
kh0pper merged 2 commits into
mainfrom
feat/llamacpp-cpu-embed-bundle

Conversation

@kh0pper

@kh0pper kh0pper commented Jun 29, 2026

Copy link
Copy Markdown
Owner

What

A CPU-only embedding bundle so hosts without a compatible GPU — or that can't reach a shared embedder like grackle-embed — can run Crow's semantic search locally.

Serves Qwen3-Embedding-0.6B (Q8_0 GGUF, 1024-dim) via llama.cpp with an OpenAI-compatible /v1/embeddings endpoint on 127.0.0.1:8007, registering the llamacpp-cpu-embed provider. Same model / vector space as the GPU vllm-cuda-embed and llamacpp-vulkan-qwen3-embed bundles, so embeddings are interchangeable.

Why

The existing embedding bundles are GPU + Linux-only (vllm-cuda-embed → NVIDIA, llamacpp-vulkan-qwen3-embed → ROCm/gfx1151). There was no option for a Mac/Windows Docker Desktop host or any CPU-only box.

Highlights

  • Runs anywhere Docker runs — no GPU, gpu_arch: ["cpu"], port bound to 127.0.0.1.
  • First request auto-downloads the ~640MB GGUF via -hf and caches it in a Docker volume (no manual model fetch).
  • Pairs with the configurable-provider change (feat(embeddings): make the embedding provider configurable #110): after install, set dashboard_settings.embed_provider = 'llamacpp-cpu-embed' (or CROW_EMBED_PROVIDER).

Changes

  • bundles/llamacpp-cpu-qwen3-embed/manifest.json, docker-compose.yml, README.md.
  • registry/add-ons.json — regenerated via npm run build-registry (single entry added, no churn).

Validation

  • npm run build-registry → 89 bundles, 0 invalid/draft/untracked.
  • npm run test:bundle-contract → 25/25 pass.

Live container test (image tag / -hf flags) happens on a Docker host as part of activation; flagging in case the published ghcr.io/ggml-org/llama.cpp:server tag or flag names need a tweak after first pull.

🤖 Generated with Claude Code

Adds a CPU-only embedding bundle so hosts without a compatible GPU (or that
can't reach grackle-embed) can run semantic search locally. Serves
Qwen3-Embedding-0.6B (Q8_0 GGUF, 1024-dim) via llama.cpp with an OpenAI-compatible
/v1/embeddings endpoint on 127.0.0.1:8007, registering the llamacpp-cpu-embed
provider. Same model / vector space as the GPU embed bundles. Runs on macOS/Windows
Docker Desktop; first request auto-downloads the GGUF via -hf and caches it.

Regenerated registry/add-ons.json via build-registry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
manifest declared contextLen 32768 but docker-compose serves --ctx-size 8192,
so inputs over 8K tokens would be silently rejected despite the advertised
capacity. The CPU bundle intentionally caps ctx at 8192 for RAM; embedding
inputs are capped at 8000 chars upstream, so 8192 is ample. Lower the declared
contextLen (manifest + regenerated registry entry) to match reality. Vector
space is unchanged (1024-dim, same model) — embeddings stay interchangeable
with the GPU bundles; only max input length differs.
@kh0pper kh0pper merged commit 5b7615d into main Jun 29, 2026
1 check failed
@kh0pper kh0pper deleted the feat/llamacpp-cpu-embed-bundle branch June 29, 2026 01:39
kh0pper added a commit that referenced this pull request Jun 29, 2026
The CPU embedding bundle (#111) binds host port 8007, but the row was never
added to docs/developers/port-allocation.md, so the Port Allocation Check CI
(scripts/check-port-allocation.js) failed on the PR and on the main push.
Add the 8007 row; check now passes (43 ports, all documented).

Co-authored-by: kh0pper <kevin.hopper@maestro.press>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant