Macaron | Capital One Tech

Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates (7 reasoning types, 22 cultural aspects), native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False verification statements. Macaron contains 11862 instances spanning 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects), 20 cultural contexts, and 10 scripts. We evaluate 21 multilingual LLMs in a zero-shot setting: reasoning-mode models perform best and show near-parity between English and local languages, whereas open-weight models degrade more in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest.

Macaron: Controlled, human-written benchmark

Latest publications

Routing with generated data

CommonLID: Re-evaluating language identification performance

Training dynamics underlying language model scaling laws: loss deceleration and zero-sum learning