Dynamic guardian models: realtime content moderation with user-defined policies
Specialized classifiers that evaluate text based on predefined trustworthiness objectives.
Large language models often exhibit safety and reliability issues in critical user-facing scenarios. While current approaches use static models to detect specific harmful categories, we propose dynamic guardian models: specialized classifiers that evaluate text based on predefined trustworthiness objectives. These models assess compliance with user-defined rules across diverse AI-mediated communication contexts through a participatory pipeline that produces synthetic datasets for training and evaluation. Our methodology incorporates diverse perspectives to define appropriate AI behavior in specific contexts. We use group relative policy optimization to improve the model's ability to reason through rule violations and articulate justifications. Experiments show our dynamic guardian models match static models in harm detection while identifying violations nearly as well as frontier reasoning models in a fraction of the time. This approach ensures alignment with stakeholder expectations and regulatory standards while providing adaptability across various communication contexts.
Latest publications
LLM-SRBench: A new benchmark for scientific equation discovery with Large Language Models
A comprehensive benchmark designed to evaluate LLM-based scientific equation discovery methods.
ICMLZero-shot meta-learning for tabular prediction tasks with adversarially pre-trained transformer
Introducing APT, an Adversarially Pre-trained Transformer achieving SOTA on small tabular tasks.
ICMLPosition: supervised classifiers answer the wrong questions for OOD detection
A critical re-examination of popular out-of-distribution (OOD) detection procedures.
ICML