Macaron: Controlled, human-written benchmark
A template-first benchmark that factorizes reasoning type and cultural aspect across question languages. (ACL)
Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates (7 reasoning types, 22 cultural aspects), native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False verification statements. Macaron contains 11862 instances spanning 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects), 20 cultural contexts, and 10 scripts. We evaluate 21 multilingual LLMs in a zero-shot setting: reasoning-mode models perform best and show near-parity between English and local languages, whereas open-weight models degrade more in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest.
Latest publications
Routing with generated data
A setting in which routers are trained on generated queries and answers produced from high-level task descriptions. (ACL)
ACLCommonLID: Re-evaluating language identification performance
A community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. (ACL)
ACLTraining dynamics underlying language model scaling laws: loss deceleration and zero-sum learning
Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws.
ACL