Re-evaluating evaluation for multilingual summarization
Standard metrics fail in non-English summarization, prompting a need for more nuanced evaluation frameworks.
Topics:
Automatic evaluation approaches (ROUGE, BERTScore, LLM-based evaluators) have been widely used to evaluate summarization tasks. Despite the complexities of script differences and tokenization, these approaches have been indiscriminately applied to summarization across multiple languages. While previous works have argued that these approaches correlate strongly with human ratings in English it remains unclear whether the conclusion holds for other languages. To answer this question, we construct a small-scale pilot dataset containing article-summary pairs and human ratings in English, Chinese and Indonesian. To measure the strength of summaries, our ratings are measured as head-to-head comparisons with resulting Elo scores across four dimensions. Our analysis reveals that standard metrics are unreliable measures of quality and that these problems are exacerbated in Chinese and Indonesian. We advocate for more nuanced and careful considerations in designing a robust evaluation framework for multiple languages.
Latest publications
GRAID: Synthetic data generation with geometric constraints and multi-agentic reflection for harmful content detection
A novel pipeline that leverages Large Language Models (LLMs) for dataset augmentation.
EMNLPMINERS: multilingual language models as semantic retrievers
A benchmark to evaluate multilingual language models for retrieving semantic similarities across 200+ languages.
EMNLPSEACrowd: A multilingual multimodal data hub and benchmark suite for Southeast Asian languages
A multilingual, multimodal hub with benchmarks for nearly 1,000 Southeast Asian languages across text, image, and audio for improved AI.
EMNLP