Confidence-based response abstinence: LLM trustworthiness
A method for confidence estimation in RAG systems that aligns closely with the correctness of LLM outputs.
We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B Instruct model show that using activations from only 16 layers preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.
Latest publications
GRAID: Synthetic data generation with geometric constraints and multi-agentic reflection for harmful content detection
A novel pipeline that leverages Large Language Models (LLMs) for dataset augmentation.
EMNLPTruthTorchLM: a comprehensive package for predicting truthfulness in LLM outputs
An open-source, comprehensive Python library featuring over 30 truthfulness prediction methods.
EMNLPLanguage surgery in multilingual Large Language Models
A novel method that leverages latent injection to enable cross-lingual language control and mitigate language confusion.
EMNLP