LLM reasoning and agentic safety at ICML 2026

Explore our latest research in critique-guided distillation and multi-turn agent uncertainty in Seoul.

July 1, 2026

Explore our latest research in critique-guided distillation and multi-turn agent uncertainty in Seoul.

Capital One technologists are excited to participate in the 43rd International Conference on Machine Learning (ICML) taking place at the COEX Convention & Exhibition Center in Seoul, South Korea, July 6-11, 2026. As a premier global venue for machine learning research, ICML provides an essential forum for exploring foundational advancements, algorithmic innovations, and cutting-edge deep learning systems.

Capital One is excited to share advancements in large language model (LLM) scaling efficiencies, multi-turn tool-using agent safety, and the development of robust, trustworthy AI frameworks. This work delivers the underlying engineering and algorithmic improvements crucial for deploying the next generation of safe financial technologies.

Main conference research: Robust reasoning and agentic risk

The following research, accepted to the ICML Main Conference, pushes the boundaries of how models self-correct, how trajectory-level risks can be proactively flagged, and how multi-turn agent interactions maintain reliable execution. This section features work led by Capital One researchers alongside deep collaborations with academic partners.

Critique-Guided Distillation for Robust Reasoning via Refinement
Capital One Authors: Berkcan Kapusuzoglu, Supriyo Chakraborty, Michael Lee, Sambit Sahu

Supervised fine-tuning with expert demonstrations often produces models that imitate outputs without internalizing the reasoning processes needed for robust generalization. While critique-based approaches show promise, training models to generate critiques directly, such as Critique Fine-Tuning (CFT), can lead to output-format drift and degradation of general capabilities. We propose Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from critique generation. During fine-tuning, the student is trained to refine flawed responses conditioned on teacher critiques. CGD treats critiques as a supervision signal, encouraging internalization of error-aware reasoning: critiques guide learning but are absent at inference. Across five model families, CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7\% average improvements and gains of up to +15.0\% on AMC23 and +12.2\% on MATH-500. On challenging competition problems such as AIME24 and AIME25, CGD achieves substantially higher Pass@1 and stronger performance at low Pass@k, indicating improved reasoning quality per sample. Importantly, CGD preserves general instruction-following capabilities where CFT degrades significantly (21.3\% on IFEval). These results position CGD as a practical and compute-efficient intermediate training paradigm for reasoning-centric tasks without introducing inference-time overhead.

EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors
Capital One Authors: Spencer Hong, Erin Babinsky, Alfy Samuel, Anoop Kumar

Access to high-quality data is often the key bottleneck in modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic text offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. This collaboration with the University of Southern California introduces EPSVec, a lightweight alternative that steers LLM generation using dataset vectors—directions in activation space that capture the distributional gap between private data and public priors. Unlike other inference-time aggregation baselines that incur high privacy costs per token, EPSVec extracts and sanitizes steering vectors just once. This decouples the privacy budget from generation, enabling unlimited synthetic sampling and high utility even in low-data regimes.

BEDTime: A Unified Benchmark for Automatically Describing Time Series
Capital One Authors: Nam Nguyen, Bayan Bruss

Recent works propose complex multi-modal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question-answering. However, they skip evaluations of simple and important foundational tasks, which complex models should reliably master. This collaboration with the University of Virginia proposes BEDTime, the first benchmark dataset to assess models on three new tasks: recognizing, differentiating, and generating language descriptions of time series. Using BEDTime, we evaluate 13 state-of-the-art models and find that dedicated time series foundation models severely underperform, vision-language models are quite capable, and language-only methods perform worst, indicating crucial avenues for future robust modeling work.

TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning
Capital One Authors: Ranganath Krishnan

As AI agents take on increasingly complex multi-turn interactions—navigating tools, tracking shifting user goals, and coordinating in real time—estimating reliable uncertainty is becoming crucial. Failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user-agent miscoordination), even when the local generation appears confident. This research, in collaboration with the University of Illinois Chicago, tackles the challenge of catching these AI agent failures early, before they spiral. We introduce TRACER, a principled, trajectory-level uncertainty metric built for multi-turn, tool-using conversational agents. TRACER combines content-aware surprisal with situational-awareness signals, semantic and lexical repetition, and tool-grounded coherence gaps. It then aggregates them through a tail-focused risk functional with failure-bound guarantees. Across various benchmarks including Tau-2-bench, ToolHop, and ComplexFuncBench, TRACER improves failure-detection AUROC by up to 37% and selective-execution accuracy by up to 55% over the strongest existing baselines. Crucially, it flags the majority of failures within the first 20% of a trajectory—early enough for graceful handoff, recovery, or abstention in complex conversational tool-use settings.

Workshop tracks: Alignment, evaluation, and optimization

Our workshop participation addresses critical nuances in AI safety alignment, model distillation efficiencies, and scalable evaluation metrics.

What Do Safety-Aligned LLMs Learn From Context Demonstrations? A Hypothesis-Testing Study of Mixed Many-Shot Contexts
Capital One Authors: Sihui Dai, Mann PatelTrack: Hypothesis Testing Workshop

Prior works have shown that many-shot demonstrations can jailbreak aligned language models, but it remains unclear how safety-aligned LLMs interpret demonstrations of compliance. We answer this question by mixing benign compliance demonstrations (unharmful request, non-refusal response) with harmful compliance demonstrations (harmful request, non-refusal response) and testing three competing hypotheses about how many-shot demonstrations drive harmful compliance. Across multiple frontier models, we consistently find that models do not treat all compliant demonstrations as interchangeable; for example instruction-tuned models, examples of benign compliance can actually increase the refusal rate on harmful examples. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

Ask, Don’t Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement
Capital One Authors: Sangwoo Cho, Kushal Chawla, Pengshan Cai, Zefang Liu, Chenyang Zhu, Shixiong Zhang, Sambit Sahu
Track: Second Workshop on Compositional Learning

Holistic LLM judges often produce opaque scores that are hard to debug. We propose BinEval, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. BinEval matches or outperforms strong baselines on factual consistency benchmarks (like QAGS) and supports iterative prompt optimization, avoiding the ceiling effects common in prior LLM judges.

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse
Capital One Authors: Amritansh Mishra, Supriyo Chakraborty, Berkcan Kapusuzoglu
Track: ICML Workshop on Foundations of Deep Generative Models

Group Relative Policy Optimization (GRPO) eliminates the learned critic in PPO by using the mean reward of grouped rollouts as a baseline. We provide a rigorous derivation of GRPO from first principles of the policy gradient theorem, revealing a fundamental credit assignment failure: under output-only reward, every token in a rollout receives identical advantage, collapsing token-level credit to a single scalar. We prove this induces gradient sparsity that intensifies over training, formalizing an intrinsic rank-2 structure that characterizes when GRPO's simplicity is theoretically justified and identifying the credit assignment bottleneck as the key limitation for multi-step reasoning.

SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation
Capital One Authors: Chenyang Zhu, Jiayu Yao, Kushal Chawla, Youbing Yin, Nathan Wolfe, Pengshan Cai, Jingyu Wu, Sangwoo Cho, Shi-Xiong Zhang, Daben Liu, Sambit Sahu, Erin Babinsky
Track: Second Workshop on Agents in the Wild: Safety, Security, and Beyond

Current methods for effectively diagnosing agent failures rely on loading the full agentic trajectory into an LLM's context window, which suffers from attention dilution and fails when trajectories grow outside architectural limits. To address this, we introduce SAFARI, a framework that replaces full context loading with a tool-augmented diagnostic loop. By equipping LLMs with a specialized toolbox to read and search trajectory segments alongside a persistent Short-Term Memory (STM) for cross-turn reasoning, SAFARI maintains a 0.58 precision even when the target fault resides 5x beyond the model’s native context window, a scenario where traditional evaluators fail entirely.

SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision
Capital One Authors: Chia-Hsuan Lee, Zelei Cheng, Yu Wang, Renkun Ni, Sambit Sahu, Shixiong Zhang, William Campbell
Track: Third AI for Math Workshop

In on-policy distillation (OPD), teacher supervision quality depends heavily on student competence, yet existing methods supervise uniformly, creating waste across tokens, training phases, and prompts. We introduce SEAD, which uses entropy as a unified probe of competence-dependent degradation. SEAD uses joint teacher-student entropy to dynamically partition tokens (skipping ~50% of redundant/noisy gradients), applies a cosine schedule to anneal divergences as competence grows, and introduces an easy-to-hard competence-gated curriculum. Across six math benchmarks, SEAD achieves +4.8 average accuracy over vanilla OPD.

Connect with Capital One at ICML 2026

If you’re attending the conference in Seoul, we invite you to engage with our researchers and explore how we are advancing mission-inspired science.

Visit our booth: 117
Explore our research: Dive deep into our latest advancements in AI and machine learning.
Discover career opportunities: Learn about exciting applied research career paths at Capital One for researchers and engineers passionate about AI and join our world-class team.
Learn about our student and grad internships: Put your knowledge and skills to work in our 10-week to two-year graduate programs innovating new products and creatively solving the problems that impact our customers and our business.

Disclaimers & Disclosures

DISCLOSURE STATEMENT: © 2026 Capital One. Opinions are those of the individual author and are not necessarily those of Capital One. Unless noted otherwise, Capital One is not affiliated with, nor endorsed by, any third parties mentioned and is not responsible for the content or privacy policies of any linked third-party sites. Any trademarks and other intellectual property used or displayed are property of their respective owners.

Related blogs

Research

Events

Research

Events

Capital One at ACL 2026

Discover how Capital One is advancing state-of-the-art AI/ML science through collaborative natural language processing research.

Capital One Science | June 30, 2026

Events

Insights from the inaugural Capital One AI Symposium

Advancing the state of the art through multi-sector partnerships.

Capital One Science | April 23, 2026

Events

Research

Events

Research

NLP research foundations at ICLR 2026

Explore our latest research in LLM alignment, uncertainty quantification and privacy-preserving synthetic data in Rio de Janeiro.

Capital One Science | April 20, 2026

Research

Events

Research

Events

Capital One at ACL 2026

Discover how Capital One is advancing state-of-the-art AI/ML science through collaborative natural language processing research.

Capital One Science | June 30, 2026

Events

Insights from the inaugural Capital One AI Symposium

Advancing the state of the art through multi-sector partnerships.

Capital One Science | April 23, 2026

Events

Research

Events

Research

NLP research foundations at ICLR 2026

Explore our latest research in LLM alignment, uncertainty quantification and privacy-preserving synthetic data in Rio de Janeiro.

Capital One Science | April 20, 2026