Macaron: Controlled, human-written benchmark for multilingual and multicultural reasoning via template-filling
A template-first benchmark that factorizes reasoning type and cultural aspect across question languages.
Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates (7 reasoning types, 22 cultural aspects), native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False verification statements. Macaron contains 11{,}862 instances spanning 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects), 20 cultural contexts, and 10 scripts. We evaluate 21 multilingual LLMs in a zero-shot setting: reasoning-mode models perform best and show near-parity between English and local languages, whereas open-weight models degrade more in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest.
Latest publications
Routing with generated data: Annotation-free LLM skill estimation and expert selection
A setting in which routers are trained on generated queries and answers produced from high-level task descriptions.
ACLCommonLID: Re-evaluating state-of-the-art language identification performance on web data
A community-driven, human-annotated LID benchmark for the web domain, covering 109 languages.
ACLTemporal tokenization strategies for event sequence modeling with Large Language Models
A study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies.
ACL