Alignment-weighted DPO
A DPO that targets the most problematic parts of an output by assigning different preference weights.
Despite recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), large language models (LLMs) remain vulnerable to various jailbreak attacks such as those rephrasing harmful intent in indirect or deceptive ways. We hypothesize that this brittleness stems from shallow alignment mechanisms that lack deep reasoning. To validate this, we perform a causal intervention by deactivating reasoning-critical neurons and observe that alignment performance remains largely unaffected, even as reasoning ability significantly deteriorates. This suggests that current alignment techniques may succeed in rejecting harmful prompts without truly understanding why they are harmful, making them susceptible to more sophisticated and deceptive jailbreak attacks. To address this, we propose strengthening alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, a reinforcement-learning approach that targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.
Latest publications
MR3: Multilingual rubric-agnostic reward reasoning models
A multilingual, rubric-agnostic reward reasoning model achieving the broadest language coverage in reward modeling to date.
ICLRYour model diversity determines reasoning strategy
A framework decomposing reasoning uncertainty and deriving conditions where depth refinement outperforms parallel sampling. (ICLR)
ICLRGenARM: Reward guided generation with autoregressive reward model for Test-time Alignment
A test-time alignment approach that leverages the Autoregressive Reward Model.
ICLR