M4-RAG: A massive-scale multilingual multi-cultural multimodal RAG
A massive-scale benchmark for evaluating retrieval-augmented VQA across languages and modalities.
Vision–language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image–question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
Latest publications
Immune: Improving safety against jailbreaks in Multi-modal LLMs via Inference-Time Alignment
An inference-time defense framework that leverages a safe reward model to defend against jailbreak attacks.
CVPRVision Language Models are confused tourists
A novel cultural adversarial robustness suite designed to assess VLMs’ stability against perturbed geographical cues.
CVPRRouting with generated data
A setting in which routers are trained on generated queries and answers produced from high-level task descriptions. (ACL)
ACL