RAFFLES: reasoning-based attribution of faults for LLM systems

An evaluation architecture that incorporates reasoning and iterative refinement.

EACL

March 24, 2026

Topics:

Current evaluation strategies for multi-component systems are predominantly one-dimensional – assessed by end-to-end performance – and static – no consideration of the contextual state and changing environments. To evaluate consistency, stability, and performance degradation of multi-turn systems, pinpointing why and where failures occur is crucial. To address this, we propose a novel, automated framework for fine-grained fault attribution: identifying not only which component fails but also the specific failure mode at each step. Our system evaluates multi-step pipelines, such as those in Retrieval-Augmented Generation and tool-using agents, and report state-of-the-art performance in identifying step-level failures. We provide a scalable alternative to labor-intensive manual analysis and establishing a new framework for multi-turn evaluation as long-horizon, autonomous tasks become more prevalent.

View article