LANCER: language-invariant retrieval
A multi-task learning framework reduces language-specific signals for improved multilingual dense retrieval.
Multilingual models aim for language-invariant representations but still prominently encode language identity. This, along with the scarcity of high-quality parallel retrieval data, limits their performance in retrieval. We introduce LANCER, a multi-task learning framework that improves language-invariant dense retrieval by reducing language-specific signals in the embedding space. Leveraging the notion of linear concept erasure, we design a loss function that penalizes cross-correlation between representations and their language labels. LANCER leverages only English retrieval data and general multilingual corpora, training models to focus on language-invariant retrieval by semantic similarity without necessitating a vast parallel corpus. Experimental results on various datasets show our method consistently improves over baselines with extensive analyses demonstrating greater language agnosticism.
Latest publications
Language surgery in multilingual Large Language Models
A novel method that leverages latent injection to enable cross-lingual language control and mitigate language confusion.
EMNLPGRAID: Synthetic data generation with geometric constraints and multi-agentic reflection for harmful content detection
A novel pipeline that leverages Large Language Models (LLMs) for dataset augmentation.
EMNLPReadability reconsidered: A cross-dataset analysis of reference-free metric
An investigation of factors shaping human perceptions of text readability and comprehensibility.
EMNLP