EPSVec: Efficient and private synthetic data generation via dataset vectors
A differentially-private lightweight alternative that steers LLM generation using dataset vectors.
Access to high-quality data is often the key bottleneck in modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic text offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods, such as private fine-tuning and private inference, are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a lightweight alternative that steers LLM generation using dataset vectors--directions in activation space that capture the distributional gap between private data and public priors. Unlike other inference-time aggregation baselines that incur high privacy costs per token, EPSVec extracts and sanitizes steering vectors just once. This decouples the privacy budget from generation, enabling unlimited synthetic sampling and high utility even in low-data regimes. Experiments show that EPSVec generates synthetic corpora that more closely match the real-data distribution than baseline methods, while eliminating model retraining and allowing scalable generations even in low data regimes.
Latest publications
Critique-guided distillation for robust reasoning via refinement
A training framework that decouples critique consumption from critique generation.
ICMLDynamic guardian models: realtime content moderation with user-defined policies
Specialized classifiers that evaluate text based on predefined trustworthiness objectives.
ICMLLLM-SRBench: A new benchmark for scientific equation discovery with Large Language Models
A comprehensive benchmark designed to evaluate LLM-based scientific equation discovery methods.
ICML