Skip to content

DataScienceUIBK/Argus-Retriever

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‘οΈ Argus-Retriever

Region-Aware Query-Conditioned Mixture-of-Experts for Visual Document Retrieval

The first late-interaction visual document retriever whose document representation adapts to the query β€” D(q) β€” while staying a drop-in ColPali-style multi-vector index.

arXiv GitHub HF Models MTEB License

Highlights β€’ Results β€’ Models β€’ Quick Start β€’ MTEB β€’ Training β€’ Citation


✨ Highlights

TL;DR β€” In ColPali, ColQwen, ColNomic and Nemotron ColEmbed, document pages are encoded without seeing the query. Argus inserts a region-aware Mixture-of-Experts inside the document encoder whose router conditions on a pooled query context z_q, so the same page is encoded differently for a table lookup, a chart question, or a paragraph-level evidence request β€” while the output stays a multi-vector index scored by MaxSim.

  • 🎯 Query-conditioned documents. A per-region router fuses region content, 2D position, and the query context z_q to mix K=4 latent experts (+1 always-on shared expert), producing a query-dependent document grid D(q).
  • πŸ† State of the art at 8–9B scale. Argus-9B reaches 86.0 NDCG@5 on the ViDoRe V1+V2 leaderboard β€” the highest reported value for an open late-interaction model β€” with the largest gains on the out-of-domain V2 split (+4.3 over the prior best).
  • πŸ“¦ Compact index. A fixed 1024-dim retrieval head β€” narrower than the 2560-dim and 4096-dim heads of recent SOTA β€” keeps the per-page index at 4.2 MB, up to 4.5Γ— smaller than Nemotron-8B.
  • ⚑ Deployable. The image encoder runs once per page offline; only the query branch, router, fusion, projection, and MaxSim run per query against cached visual grids. Argus-9B is 13.6Γ— faster offline and 2.0Γ— faster per query than Nemotron-8B.
  • πŸ§ͺ Honest training budget. Trained on 9.3% of the available public supervision (593,677 pairs), with no model soup, seed averaging, or checkpoint merging.
Argus architecture
Argus architecture. The query branch emits retrieval embeddings Q and a pooled context z_q; the document branch taps the backbone at two depths, routes pooled regions with z_q, and fuses latent + shared experts into a query-conditioned grid D(q) scored by MaxSim. (drop your figure here)

πŸ“Š Results

ViDoRe V1 + V2 Leaderboard β€” NDCG@5

Model Dim V1 V2 Avg
ColQwen2.5 128 89.5 59.3 80.9
ColNomic-7b 128 89.7 60.8 81.5
Sauerkraut-8b β€” 91.1 62.9 83.0
Nemotron-colembed-4b-v2 2560 91.6 63.9 83.7
Ops-Colqwen3-4B 2560 91.4 67.8 84.6
Nemotron-colembed-8b-v2 4096 92.7 64.9 84.7
🟦 Argus-2B 1024 91.5 61.5 82.9
🟦 Argus-4B 1024 92.3 64.1 84.2
🟦 Argus-9B 1024 92.7 69.2 πŸ₯‡ 86.0

Argus-9B is the best system on all four out-of-domain V2 tasks (BiomedicalLectures, ESGReports, ESGReports-HighLevel, EconomicsReports).

ViDoRe V3 (public tasks) β€” NDCG@10

Model Dim Avg
Nemotron-colembed-8b-v2 4096 63.53
Nemotron-colembed-4b-v2 2560 62.02
Ops-Colqwen3-4B 2560 61.26
🟦 Argus-2B 1024 60.09
🟦 Argus-4B 1024 62.09
🟦 Argus-9B 1024 62.50

MIRACL-Vision (multilingual) β€” NDCG@10

Model Macro Avg (10 lang)
Nemotron-colembed-8b-v2 0.7492
🟦 Argus-9B πŸ₯‡ 0.7552

Best system on 5 / 10 languages, including the long-tail Yoruba split (0.8099 vs. 0.5252 for Nemotron-8B).

Efficiency

Model Tok/page Dim MB/page Doc encode (ms) Query online (ms)
Nemotron-colembed-8b-v2 2304 4096 18.9 5090 278
🟦 Argus-9B 2048 1024 4.2 374 136

Single H100 80GB, bf16, batch 1, on the ViDoRe V2 ESG Reports task.


πŸ€— Models

Model Backbone Params Dim Experts πŸ€— Hugging Face
Argus-2B Qwen3.5-VL-2B 2.32B 1024 4 (top-2) DataScience-UIBK/Argus-Colqwen3.5-2b-v0
Argus-4B Qwen3.5-VL-4B 4.71B 1024 4 (top-2) DataScience-UIBK/Argus-Colqwen3.5-4b-v0
Argus-9B Qwen3.5-VL-9B 8.82B 1024 4 (top-2) DataScience-UIBK/Argus-Colqwen3.5-9b-v0

Each model also ships a -bf16 sibling (e.g. …-9b-v0-bf16) for lower-memory inference.


πŸš€ Quick Start

Installation

pip install "transformers>=5.0.0,<6.0.0" torch pillow

⚠️ Argus needs transformers>=5.0 β€” the Qwen3.5-VL backbone (transformers.models.qwen3_5) only ships in the 5.x line. If you install MTEB first, upgrade transformers afterwards and clear ~/.cache/huggingface/modules/transformers_modules.

Inference

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_id = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0"

model = AutoModel.from_pretrained(
    model_id, trust_remote_code=True, dtype="bfloat16"
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Encode queries and document page images
queries = ["What was the revenue in 2019?", "What does the chart show over time?"]
images  = [Image.open("page_1.png"), Image.open("page_2.png")]

q = model.encode_queries(processor, queries)
d = model.encode_images(processor, images)

# MaxSim late-interaction scoring -> [num_queries x num_docs]
scores = processor.score(q, d)
print(scores)

πŸ“ Evaluation with MTEB

Argus is registered on the MTEB ViDoRe leaderboard. To reproduce the V1/V2/V3 numbers:

pip install "mteb>=2.12,<3.0.0"
# IMPORTANT: re-pin transformers AFTER mteb (mteb pulls 4.57.x, which lacks qwen3_5)
pip install "transformers>=5.0.0,<6.0.0"
rm -rf ~/.cache/huggingface/modules/transformers_modules
import mteb

model = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0")

# ViDoRe V2 β€” out-of-domain split
tasks = mteb.get_tasks(tasks=[
    "Vidore2BiomedicalLecturesRetrieval",
    "Vidore2ESGReportsRetrieval",
    "Vidore2ESGReportsHumanLabeledRetrieval",
    "Vidore2EconomicsReportsRetrieval",
])

results = mteb.MTEB(tasks=tasks).run(model, output_folder="results/argus-9b")

The same call works for the ViDoRe V1 tasks and the public ViDoRe V3 tasks β€” swap the task list. Document features are encoded once per page; query conditioning is applied online against the cached grids, matching the deployment path described in the paper.


πŸ‹οΈ Training

🚧 Coming soon. Training code, configs, and the router-warmup-then-joint recipe will be released here. Stay tuned.


πŸ“ Repository Structure

Argus-Retriever/
β”œβ”€β”€ README.md              # this file
β”œβ”€β”€ assets/                # figures
β”œβ”€β”€ inference/             # 🚧 coming soon
└── training/              # 🚧 coming soon

πŸ“ Citation

If you use Argus in your research, please cite:

@article{abdallah2026argus,
  title={Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval},
  author={Abdallah, Abdelrahman and Abdalla, Mahmoud and Ali, Mohammed and Jatowt, Adam},
  journal={arXiv preprint arXiv:2606.04300},
  year={2026}
}

πŸ“„ License

Released under the Apache-2.0 license.

πŸ™ Acknowledgements

Argus builds on the ColPali late-interaction line and the Qwen3.5-VL backbone. We thank the ViDoRe and MTEB maintainers for the benchmark infrastructure, and the ColPali, ColQwen, ColNomic, and Nemotron ColEmbed teams for releasing the baselines that made fair comparison possible.

About

Region-Aware Query-Conditioned Mixture-of-Experts for Visual Document Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors