Dynamic guardian models: realtime content moderation with user-defined policies

Specialized classifiers that evaluate text based on predefined trustworthiness objectives.

ICML

July 13, 2025

Topics:

Large Language Models (LLMs)Reasoning & Chain-of-thought (CoT)Account Takeover

Large language models often exhibit safety and reliability issues in critical user-facing scenarios. While current approaches use static models to detect specific harmful categories, we propose dynamic guardian models: specialized classifiers that evaluate text based on predefined trustworthiness objectives. These models assess compliance with user-defined rules across diverse AI-mediated communication contexts through a participatory pipeline that produces synthetic datasets for training and evaluation. Our methodology incorporates diverse perspectives to define appropriate AI behavior in specific contexts. We use group relative policy optimization to improve the model's ability to reason through rule violations and articulate justifications. Experiments show our dynamic guardian models match static models in harm detection while identifying violations nearly as well as frontier reasoning models in a fraction of the time. This approach ensures alignment with stakeholder expectations and regulatory standards while providing adaptability across various communication contexts.

View article