Macaron: Controlled, human-written benchmark

A template-first benchmark that factorizes reasoning type and cultural aspect across question languages. (ACL)


Latest publications