The first late-interaction visual document retriever whose document representation adapts to the query — D(q) — while staying a drop-in ColPali-style multi-vector index.
Highlights • Results • Models • Quick Start • MTEB • Training • Citation
TL;DR — In ColPali, ColQwen, ColNomic and Nemotron ColEmbed, document pages are encoded without seeing the query. Argus inserts a region-aware Mixture-of-Experts inside the document encoder whose router conditions on a pooled query context z_q, so the same page is encoded differently for a table lookup, a chart question, or a paragraph-level evidence request — while the output stays a multi-vector index scored by MaxSim.
- 🎯 Query-conditioned documents. A per-region router fuses region content, 2D position, and the query context z_q to mix
K=4latent experts (+1 always-on shared expert), producing a query-dependent document grid D(q). - 🏆 State of the art at 8–9B scale. Argus-9B reaches 86.0 NDCG@5 on the ViDoRe V1+V2 leaderboard — the highest reported value for an open late-interaction model — with the largest gains on the out-of-domain V2 split (+4.3 over the prior best).
- 📦 Compact index. A fixed 1024-dim retrieval head — narrower than the 2560-dim and 4096-dim heads of recent SOTA — keeps the per-page index at 4.2 MB, up to 4.5× smaller than Nemotron-8B.
- ⚡ Deployable. The image encoder runs once per page offline; only the query branch, router, fusion, projection, and MaxSim run per query against cached visual grids. Argus-9B is 13.6× faster offline and 2.0× faster per query than Nemotron-8B.
- 🧪 Honest training budget. Trained on 9.3% of the available public supervision (593,677 pairs), with no model soup, seed averaging, or checkpoint merging.
Argus architecture. The query branch emits retrieval embeddings Q and a pooled context z_q; the document branch taps the backbone at two depths, routes pooled regions with z_q, and fuses latent + shared experts into a query-conditioned grid D(q) scored by MaxSim. (drop your figure here)
| Model | Dim | V1 | V2 | Avg |
|---|---|---|---|---|
| ColQwen2.5 | 128 | 89.5 | 59.3 | 80.9 |
| ColNomic-7b | 128 | 89.7 | 60.8 | 81.5 |
| Sauerkraut-8b | — | 91.1 | 62.9 | 83.0 |
| Nemotron-colembed-4b-v2 | 2560 | 91.6 | 63.9 | 83.7 |
| Ops-Colqwen3-4B | 2560 | 91.4 | 67.8 | 84.6 |
| Nemotron-colembed-8b-v2 | 4096 | 92.7 | 64.9 | 84.7 |
| 🟦 Argus-2B | 1024 | 91.5 | 61.5 | 82.9 |
| 🟦 Argus-4B | 1024 | 92.3 | 64.1 | 84.2 |
| 🟦 Argus-9B | 1024 | 92.7 | 69.2 | 🥇 86.0 |
Argus-9B is the best system on all four out-of-domain V2 tasks (BiomedicalLectures, ESGReports, ESGReports-HighLevel, EconomicsReports).
| Model | Dim | Avg |
|---|---|---|
| Nemotron-colembed-8b-v2 | 4096 | 63.53 |
| Nemotron-colembed-4b-v2 | 2560 | 62.02 |
| Ops-Colqwen3-4B | 2560 | 61.26 |
| 🟦 Argus-2B | 1024 | 60.09 |
| 🟦 Argus-4B | 1024 | 62.09 |
| 🟦 Argus-9B | 1024 | 62.50 |
| Model | Macro Avg (10 lang) |
|---|---|
| Nemotron-colembed-8b-v2 | 0.7492 |
| 🟦 Argus-9B | 🥇 0.7552 |
Best system on 5 / 10 languages, including the long-tail Yoruba split (0.8099 vs. 0.5252 for Nemotron-8B).
| Model | Tok/page | Dim | MB/page | Doc encode (ms) | Query online (ms) |
|---|---|---|---|---|---|
| Nemotron-colembed-8b-v2 | 2304 | 4096 | 18.9 | 5090 | 278 |
| 🟦 Argus-9B | 2048 | 1024 | 4.2 | 374 | 136 |
Single H100 80GB, bf16, batch 1, on the ViDoRe V2 ESG Reports task.
| Model | Backbone | Params | Dim | Experts | 🤗 Hugging Face |
|---|---|---|---|---|---|
| Argus-2B | Qwen3.5-VL-2B | 2.32B | 1024 | 4 (top-2) | DataScience-UIBK/Argus-Colqwen3.5-2b-v0 |
| Argus-4B | Qwen3.5-VL-4B | 4.71B | 1024 | 4 (top-2) | DataScience-UIBK/Argus-Colqwen3.5-4b-v0 |
| Argus-9B | Qwen3.5-VL-9B | 8.82B | 1024 | 4 (top-2) | DataScience-UIBK/Argus-Colqwen3.5-9b-v0 |
Each model also ships a
-bf16sibling (e.g.…-9b-v0-bf16) for lower-memory inference.
pip install "transformers>=5.0.0,<6.0.0" torch pillow
⚠️ Argus needstransformers>=5.0— the Qwen3.5-VL backbone (transformers.models.qwen3_5) only ships in the 5.x line. If you install MTEB first, upgrade transformers afterwards and clear~/.cache/huggingface/modules/transformers_modules.
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
model_id = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0"
model = AutoModel.from_pretrained(
model_id, trust_remote_code=True, dtype="bfloat16"
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Encode queries and document page images
queries = ["What was the revenue in 2019?", "What does the chart show over time?"]
images = [Image.open("page_1.png"), Image.open("page_2.png")]
q = model.encode_queries(processor, queries)
d = model.encode_images(processor, images)
# MaxSim late-interaction scoring -> [num_queries x num_docs]
scores = processor.score(q, d)
print(scores)
Argus is registered on the MTEB ViDoRe leaderboard. To reproduce the V1/V2/V3 numbers:
pip install "mteb>=2.12,<3.0.0"
# IMPORTANT: re-pin transformers AFTER mteb (mteb pulls 4.57.x, which lacks qwen3_5)
pip install "transformers>=5.0.0,<6.0.0"
rm -rf ~/.cache/huggingface/modules/transformers_modules
import mteb
model = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0")
# ViDoRe V2 — out-of-domain split
tasks = mteb.get_tasks(tasks=[
"Vidore2BiomedicalLecturesRetrieval",
"Vidore2ESGReportsRetrieval",
"Vidore2ESGReportsHumanLabeledRetrieval",
"Vidore2EconomicsReportsRetrieval",
])
results = mteb.MTEB(tasks=tasks).run(model, output_folder="results/argus-9b")
The same call works for the ViDoRe V1 tasks and the public ViDoRe V3 tasks — swap the task list. Document features are encoded once per page; query conditioning is applied online against the cached grids, matching the deployment path described in the paper.
🚧 Coming soon. Training code, configs, and the router-warmup-then-joint recipe will be released here. Stay tuned.
Argus-Retriever/
├── README.md # this file
├── assets/ # figures
├── inference/ # 🚧 coming soon
└── training/ # 🚧 coming soon
If you use Argus in your research, please cite:
@article{abdallah2026argus,
title={Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval},
author={Abdallah, Abdelrahman and Abdalla, Mahmoud and Ali, Mohammed and Jatowt, Adam},
journal={arXiv preprint arXiv:2606.04300},
year={2026}
}
Released under the Apache-2.0 license.
Argus builds on the ColPali late-interaction line and the Qwen3.5-VL backbone. We thank the ViDoRe and MTEB maintainers for the benchmark infrastructure, and the ColPali, ColQwen, ColNomic, and Nemotron ColEmbed teams for releasing the baselines that made fair comparison possible.