Your model diversity determines reasoning strategy
A framework decomposing reasoning uncertainty and deriving conditions where depth refinement outperforms parallel sampling. (ICLR)
Compute scaling for LLM reasoning requires allocating budget between exploring solution approaches (breadth) and refining promising solutions (depth). Most methods implicitly trade off one for the other, yet why a given trade-off works remains unclear, and validation on a single model obscures the role of the model itself. We argue that the optimal strategy depends on the model's diversity profile, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted. We formalize this through a theoretical framework decomposing reasoning uncertainty and derive conditions under which tree-style depth refinement outperforms parallel sampling. We validate it on Qwen-3 4B and Olmo-3 7B families, showing that lightweight signals suffice for depth-based refinement on low-diversity aligned models while yielding limited utility for high-diversity base models, which we hypothesize require stronger compensation for lower exploration coverage.
Latest publications
Alignment-weighted DPO
A DPO that targets the most problematic parts of an output by assigning different preference weights.
ICLRMR3: Multilingual rubric-agnostic reward reasoning models
A multilingual, rubric-agnostic reward reasoning model achieving the broadest language coverage in reward modeling to date.
ICLREPSVec: Efficient and Private Synthetic Data Generation
A private text generation method that steers LLM generation using dataset vectors. (ICLR)
ICLR