Macaron: Controlled, human-written benchmark for multilingual and multicultural reasoning via template-filling

A template-first benchmark that factorizes reasoning type and cultural aspect across question languages.


Latest publications