Macaron: Controlled, human-written benchmark for multilingual and multicultural reasoning via template-filling

A template-first benchmark that factorizes reasoning type and cultural aspect across question languages.

ACL

July 2, 2026

Topics:

Benchmark Foundation Models Large Language Models (LLMs)Reasoning & Chain-of-thought (CoT)Account Takeover

Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates (7 reasoning types, 22 cultural aspects), native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False verification statements. Macaron contains 11{,}862 instances spanning 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects), 20 cultural contexts, and 10 scripts. We evaluate 21 multilingual LLMs in a zero-shot setting: reasoning-mode models perform best and show near-parity between English and local languages, whereas open-weight models degrade more in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest.

View article