Improving consistency in retrieval-augmented systems with group similarity reward
An RL approach that leverages multiple rollouts across paraphrased set to assign group similarity rewards.
Retrieval-Augmented Generation (RAG) systems enhance factual accuracy by grounding language model outputs in relevant documents retrieved from external corpora. However, they often exhibit inconsistencies across semantically equivalent or paraphrased inputs, undermining user trust and reliability, particularly in high- stakes applications. These inconsistencies stem from two primary sources: (1) variation in the retriever, which may return different document sets for similar queries, and (2) stochasticity in the generator, which can produce divergent outputs even under identical retrieval. Despite its importance, output consistency in RAG systems remains underexplored. In this work, we present a systematic framework for measuring and improving RAG output consistency. We introduce a rigorous evaluation protocol that quantifies both retriever-level consistency (via document set overlap) and generator-level consistency (via output similarity across paraphrased queries), using metrics such as lexical agreement and LLM-based judgments. To improve consistency, we propose a reinforcement learning approach that leverages the Group Reward Policy Optimization (GRPO) algorithm. Specifically, we utilize GRPO's extensive rollouts per query to compute group similarity rewards that captures consistency across paraphrased queries. Empirical results on multiple QA datasets demonstrate that our method significantly improves output consistency without nout compromising factual accuracy, offering a scalable and effective solution to a critical reliability challenge in RAG systems.
Latest publications
ViCrit: a verifiable reinforcement learning proxy task for visual perception in VLMs
An RL proxy task that trains VLMs to localize synthetic hallucinations injected into human-written captions.
NeurIPSBEDTime: A unified benchmark for automatically describing time series
The first benchmark dataset to assess models on each task, comprising four datasets reformatted for these tasks.
NeurIPSAI progress should be measured by capability-per-resource
A theoretical framework demonstrating that decisions guided by gradient influence patterns can improve efficiency.
NeurIPS