👁️ Argus-Retriever

Region-Aware Query-Conditioned Mixture-of-Experts for Visual Document Retrieval

The first late-interaction visual document retriever whose document representation adapts to the query — D(q) — while staying a drop-in ColPali-style multi-vector index.

Highlights • Results • Models • Quick Start • MTEB • Training • Citation

✨ Highlights

TL;DR — In ColPali, ColQwen, ColNomic and Nemotron ColEmbed, document pages are encoded without seeing the query. Argus inserts a region-aware Mixture-of-Experts inside the document encoder whose router conditions on a pooled query context z_q, so the same page is encoded differently for a table lookup, a chart question, or a paragraph-level evidence request — while the output stays a multi-vector index scored by MaxSim.

🎯 Query-conditioned documents. A per-region router fuses region content, 2D position, and the query context z_q to mix K=4 latent experts (+1 always-on shared expert), producing a query-dependent document grid D(q).
🏆 State of the art at 8–9B scale. Argus-9B reaches 86.0 NDCG@5 on the ViDoRe V1+V2 leaderboard — the highest reported value for an open late-interaction model — with the largest gains on the out-of-domain V2 split (+4.3 over the prior best).
📦 Compact index. A fixed 1024-dim retrieval head — narrower than the 2560-dim and 4096-dim heads of recent SOTA — keeps the per-page index at 4.2 MB, up to 4.5× smaller than Nemotron-8B.
⚡ Deployable. The image encoder runs once per page offline; only the query branch, router, fusion, projection, and MaxSim run per query against cached visual grids. Argus-9B is 13.6× faster offline and 2.0× faster per query than Nemotron-8B.
🧪 Honest training budget. Trained on 9.3% of the available public supervision (593,677 pairs), with no model soup, seed averaging, or checkpoint merging.

_{Argus architecture. The query branch emits retrieval embeddings Q and a pooled context z_q; the document branch taps the backbone at two depths, routes pooled regions with z_q, and fuses latent + shared experts into a query-conditioned grid D(q) scored by MaxSim. (drop your figure here)}

📊 Results

ViDoRe V1 + V2 Leaderboard — NDCG@5

Model	Dim	V1	V2	Avg
ColQwen2.5	128	89.5	59.3	80.9
ColNomic-7b	128	89.7	60.8	81.5
Sauerkraut-8b	—	91.1	62.9	83.0
Nemotron-colembed-4b-v2	2560	91.6	63.9	83.7
Ops-Colqwen3-4B	2560	91.4	67.8	84.6
Nemotron-colembed-8b-v2	4096	92.7	64.9	84.7
🟦 Argus-2B	1024	91.5	61.5	82.9
🟦 Argus-4B	1024	92.3	64.1	84.2
🟦 Argus-9B	1024	92.7	69.2	🥇 86.0

Argus-9B is the best system on all four out-of-domain V2 tasks (BiomedicalLectures, ESGReports, ESGReports-HighLevel, EconomicsReports).

ViDoRe V3 (public tasks) — NDCG@10

Model	Dim	Avg
Nemotron-colembed-8b-v2	4096	63.53
Nemotron-colembed-4b-v2	2560	62.02
Ops-Colqwen3-4B	2560	61.26
🟦 Argus-2B	1024	60.09
🟦 Argus-4B	1024	62.09
🟦 Argus-9B	1024	62.50

MIRACL-Vision (multilingual) — NDCG@10

Model	Macro Avg (10 lang)
Nemotron-colembed-8b-v2	0.7492
🟦 Argus-9B	🥇 0.7552

Best system on 5 / 10 languages, including the long-tail Yoruba split (0.8099 vs. 0.5252 for Nemotron-8B).

Efficiency

Model	Tok/page	Dim	MB/page	Doc encode (ms)	Query online (ms)
Nemotron-colembed-8b-v2	2304	4096	18.9	5090	278
🟦 Argus-9B	2048	1024	4.2	374	136

_{Single H100 80GB, bf16, batch 1, on the ViDoRe V2 ESG Reports task.}

🤗 Models

Model	Backbone	Params	Dim	Experts	🤗 Hugging Face
Argus-2B	Qwen3.5-VL-2B	2.32B	1024	4 (top-2)	`DataScience-UIBK/Argus-Colqwen3.5-2b-v0`
Argus-4B	Qwen3.5-VL-4B	4.71B	1024	4 (top-2)	`DataScience-UIBK/Argus-Colqwen3.5-4b-v0`
Argus-9B	Qwen3.5-VL-9B	8.82B	1024	4 (top-2)	`DataScience-UIBK/Argus-Colqwen3.5-9b-v0`

Each model also ships a -bf16 sibling (e.g. …-9b-v0-bf16) for lower-memory inference.

🚀 Quick Start

Installation

pip install "transformers>=5.0.0,<6.0.0" torch pillow

⚠️ Argus needs transformers>=5.0 — the Qwen3.5-VL backbone (transformers.models.qwen3_5) only ships in the 5.x line. If you install MTEB first, upgrade transformers afterwards and clear ~/.cache/huggingface/modules/transformers_modules.

Inference

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_id = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0"

model = AutoModel.from_pretrained(
    model_id, trust_remote_code=True, dtype="bfloat16"
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Encode queries and document page images
queries = ["What was the revenue in 2019?", "What does the chart show over time?"]
images  = [Image.open("page_1.png"), Image.open("page_2.png")]

q = model.encode_queries(processor, queries)
d = model.encode_images(processor, images)

# MaxSim late-interaction scoring -> [num_queries x num_docs]
scores = processor.score(q, d)
print(scores)

📐 Evaluation with MTEB

Argus is registered on the MTEB ViDoRe leaderboard. To reproduce the V1/V2/V3 numbers:

pip install "mteb>=2.12,<3.0.0"
# IMPORTANT: re-pin transformers AFTER mteb (mteb pulls 4.57.x, which lacks qwen3_5)
pip install "transformers>=5.0.0,<6.0.0"
rm -rf ~/.cache/huggingface/modules/transformers_modules

import mteb

model = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0")

# ViDoRe V2 — out-of-domain split
tasks = mteb.get_tasks(tasks=[
    "Vidore2BiomedicalLecturesRetrieval",
    "Vidore2ESGReportsRetrieval",
    "Vidore2ESGReportsHumanLabeledRetrieval",
    "Vidore2EconomicsReportsRetrieval",
])

results = mteb.MTEB(tasks=tasks).run(model, output_folder="results/argus-9b")

The same call works for the ViDoRe V1 tasks and the public ViDoRe V3 tasks — swap the task list. Document features are encoded once per page; query conditioning is applied online against the cached grids, matching the deployment path described in the paper.

🏋️ Training

🚧 Coming soon. Training code, configs, and the router-warmup-then-joint recipe will be released here. Stay tuned.

📁 Repository Structure

Argus-Retriever/
├── README.md              # this file
├── assets/                # figures
├── inference/             # 🚧 coming soon
└── training/              # 🚧 coming soon

📝 Citation

If you use Argus in your research, please cite:

@article{abdallah2026argus,
  title={Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval},
  author={Abdallah, Abdelrahman and Abdalla, Mahmoud and Ali, Mohammed and Jatowt, Adam},
  journal={arXiv preprint arXiv:2606.04300},
  year={2026}
}

📄 License

Released under the Apache-2.0 license.

🙏 Acknowledgements

Argus builds on the ColPali late-interaction line and the Qwen3.5-VL backbone. We thank the ViDoRe and MTEB maintainers for the benchmark infrastructure, and the ColPali, ColQwen, ColNomic, and Nemotron ColEmbed teams for releasing the baselines that made fair comparison possible.

_{Built by the Data Science Group @ University of Innsbruck}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👁️ Argus-Retriever

Region-Aware Query-Conditioned Mixture-of-Experts for Visual Document Retrieval

✨ Highlights

📊 Results

ViDoRe V1 + V2 Leaderboard — NDCG@5

ViDoRe V3 (public tasks) — NDCG@10

MIRACL-Vision (multilingual) — NDCG@10

Efficiency

🤗 Models

🚀 Quick Start

Installation

Inference

📐 Evaluation with MTEB

🏋️ Training

📁 Repository Structure

📝 Citation

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
inference		inference
training		training
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

👁️ Argus-Retriever

Region-Aware Query-Conditioned Mixture-of-Experts for Visual Document Retrieval

✨ Highlights

📊 Results

ViDoRe V1 + V2 Leaderboard — NDCG@5

ViDoRe V3 (public tasks) — NDCG@10

MIRACL-Vision (multilingual) — NDCG@10

Efficiency

🤗 Models

🚀 Quick Start

Installation

Inference

📐 Evaluation with MTEB

🏋️ Training

📁 Repository Structure

📝 Citation

📄 License

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages