RAFFLES: reasoning-based attribution of faults for LLM systems
An evaluation architecture that incorporates reasoning and iterative refinement.
Current evaluation strategies for multi-component systems are predominantly one-dimensional – assessed by end-to-end performance – and static – no consideration of the contextual state and changing environments. To evaluate consistency, stability, and performance degradation of multi-turn systems, pinpointing why and where failures occur is crucial. To address this, we propose a novel, automated framework for fine-grained fault attribution: identifying not only which component fails but also the specific failure mode at each step. Our system evaluates multi-step pipelines, such as those in Retrieval-Augmented Generation and tool-using agents, and report state-of-the-art performance in identifying step-level failures. We provide a scalable alternative to labor-intensive manual analysis and establishing a new framework for multi-turn evaluation as long-horizon, autonomous tasks become more prevalent.
Latest publications
DF-RAG: Enhancing RAG for question answering by balancing relevance and diversity of retrieved chunks
A pipeline that dynamically adapts the level of diversity for each query at test time without requiring prior information.
EACLART: Adaptive Reasoning Trees for explainable claim verification
A hierarchical method for claim verification in Large Language Models.
EACLDeconstructing instruction-following: A new benchmark for granular analysis of Large Language Model instruction compliance abilities
A modular framework that uses a dynamically generated dataset to evaluate the capability of Large Language Models.
EACL