The first late-interaction visual document retriever whose document representation adapts to the query β D(q) β while staying a drop-in ColPali-style multi-vector index.
Highlights β’ Results β’ Models β’ Quick Start β’ MTEB β’ Training β’ Citation
TL;DR β In ColPali, ColQwen, ColNomic and Nemotron ColEmbed, document pages are encoded without seeing the query. Argus inserts a region-aware Mixture-of-Experts inside the document encoder whose router conditions on a pooled query context z_q, so the same page is encoded differently for a table lookup, a chart question, or a paragraph-level evidence request β while the output stays a multi-vector index scored by MaxSim.
- π― Query-conditioned documents. A per-region router fuses region content, 2D position, and the query context z_q to mix
K=4latent experts (+1 always-on shared expert), producing a query-dependent document grid D(q). - π State of the art at 8β9B scale. Argus-9B reaches 86.0 NDCG@5 on the ViDoRe V1+V2 leaderboard β the highest reported value for an open late-interaction model β with the largest gains on the out-of-domain V2 split (+4.3 over the prior best).
- π¦ Compact index. A fixed 1024-dim retrieval head β narrower than the 2560-dim and 4096-dim heads of recent SOTA β keeps the per-page index at 4.2 MB, up to 4.5Γ smaller than Nemotron-8B.
- β‘ Deployable. The image encoder runs once per page offline; only the query branch, router, fusion, projection, and MaxSim run per query against cached visual grids. Argus-9B is 13.6Γ faster offline and 2.0Γ faster per query than Nemotron-8B.
- π§ͺ Honest training budget. Trained on 9.3% of the available public supervision (593,677 pairs), with no model soup, seed averaging, or checkpoint merging.
Argus architecture. The query branch emits retrieval embeddings Q and a pooled context z_q; the document branch taps the backbone at two depths, routes pooled regions with z_q, and fuses latent + shared experts into a query-conditioned grid D(q) scored by MaxSim. (drop your figure here)
| Model | Dim | V1 | V2 | Avg |
|---|---|---|---|---|
| ColQwen2.5 | 128 | 89.5 | 59.3 | 80.9 |
| ColNomic-7b | 128 | 89.7 | 60.8 | 81.5 |
| Sauerkraut-8b | β | 91.1 | 62.9 | 83.0 |
| Nemotron-colembed-4b-v2 | 2560 | 91.6 | 63.9 | 83.7 |
| Ops-Colqwen3-4B | 2560 | 91.4 | 67.8 | 84.6 |
| Nemotron-colembed-8b-v2 | 4096 | 92.7 | 64.9 | 84.7 |
| π¦ Argus-2B | 1024 | 91.5 | 61.5 | 82.9 |
| π¦ Argus-4B | 1024 | 92.3 | 64.1 | 84.2 |
| π¦ Argus-9B | 1024 | 92.7 | 69.2 | π₯ 86.0 |
Argus-9B is the best system on all four out-of-domain V2 tasks (BiomedicalLectures, ESGReports, ESGReports-HighLevel, EconomicsReports).
| Model | Dim | Avg |
|---|---|---|
| Nemotron-colembed-8b-v2 | 4096 | 63.53 |
| Nemotron-colembed-4b-v2 | 2560 | 62.02 |
| Ops-Colqwen3-4B | 2560 | 61.26 |
| π¦ Argus-2B | 1024 | 60.09 |
| π¦ Argus-4B | 1024 | 62.09 |
| π¦ Argus-9B | 1024 | 62.50 |
| Model | Macro Avg (10 lang) |
|---|---|
| Nemotron-colembed-8b-v2 | 0.7492 |
| π¦ Argus-9B | π₯ 0.7552 |
Best system on 5 / 10 languages, including the long-tail Yoruba split (0.8099 vs. 0.5252 for Nemotron-8B).
| Model | Tok/page | Dim | MB/page | Doc encode (ms) | Query online (ms) |
|---|---|---|---|---|---|
| Nemotron-colembed-8b-v2 | 2304 | 4096 | 18.9 | 5090 | 278 |
| π¦ Argus-9B | 2048 | 1024 | 4.2 | 374 | 136 |
Single H100 80GB, bf16, batch 1, on the ViDoRe V2 ESG Reports task.
| Model | Backbone | Params | Dim | Experts | π€ Hugging Face |
|---|---|---|---|---|---|
| Argus-2B | Qwen3.5-VL-2B | 2.32B | 1024 | 4 (top-2) | DataScience-UIBK/Argus-Colqwen3.5-2b-v0 |
| Argus-4B | Qwen3.5-VL-4B | 4.71B | 1024 | 4 (top-2) | DataScience-UIBK/Argus-Colqwen3.5-4b-v0 |
| Argus-9B | Qwen3.5-VL-9B | 8.82B | 1024 | 4 (top-2) | DataScience-UIBK/Argus-Colqwen3.5-9b-v0 |
Each model also ships a
-bf16sibling (e.g.β¦-9b-v0-bf16) for lower-memory inference.
pip install "transformers>=5.0.0,<6.0.0" torch pillow
β οΈ Argus needstransformers>=5.0β the Qwen3.5-VL backbone (transformers.models.qwen3_5) only ships in the 5.x line. If you install MTEB first, upgrade transformers afterwards and clear~/.cache/huggingface/modules/transformers_modules.
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
model_id = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0"
model = AutoModel.from_pretrained(
model_id, trust_remote_code=True, dtype="bfloat16"
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Encode queries and document page images
queries = ["What was the revenue in 2019?", "What does the chart show over time?"]
images = [Image.open("page_1.png"), Image.open("page_2.png")]
q = model.encode_queries(processor, queries)
d = model.encode_images(processor, images)
# MaxSim late-interaction scoring -> [num_queries x num_docs]
scores = processor.score(q, d)
print(scores)Argus is registered on the MTEB ViDoRe leaderboard. To reproduce the V1/V2/V3 numbers:
pip install "mteb>=2.12,<3.0.0"
# IMPORTANT: re-pin transformers AFTER mteb (mteb pulls 4.57.x, which lacks qwen3_5)
pip install "transformers>=5.0.0,<6.0.0"
rm -rf ~/.cache/huggingface/modules/transformers_modulesimport mteb
model = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0")
# ViDoRe V2 β out-of-domain split
tasks = mteb.get_tasks(tasks=[
"Vidore2BiomedicalLecturesRetrieval",
"Vidore2ESGReportsRetrieval",
"Vidore2ESGReportsHumanLabeledRetrieval",
"Vidore2EconomicsReportsRetrieval",
])
results = mteb.MTEB(tasks=tasks).run(model, output_folder="results/argus-9b")The same call works for the ViDoRe V1 tasks and the public ViDoRe V3 tasks β swap the task list. Document features are encoded once per page; query conditioning is applied online against the cached grids, matching the deployment path described in the paper.
π§ Coming soon. Training code, configs, and the router-warmup-then-joint recipe will be released here. Stay tuned.
Argus-Retriever/
βββ README.md # this file
βββ assets/ # figures
βββ inference/ # π§ coming soon
βββ training/ # π§ coming soon
If you use Argus in your research, please cite:
@article{abdallah2026argus,
title={Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval},
author={Abdallah, Abdelrahman and Abdalla, Mahmoud and Ali, Mohammed and Jatowt, Adam},
journal={arXiv preprint arXiv:2606.04300},
year={2026}
}Released under the Apache-2.0 license.
Argus builds on the ColPali late-interaction line and the Qwen3.5-VL backbone. We thank the ViDoRe and MTEB maintainers for the benchmark infrastructure, and the ColPali, ColQwen, ColNomic, and Nemotron ColEmbed teams for releasing the baselines that made fair comparison possible.