EPSVec: Efficient and private synthetic data generation via dataset vectors

A differentially-private lightweight alternative that steers LLM generation using dataset vectors.

ICML

July 6, 2026

Topics:

Fine-Tuning Large Language Models (LLMs)

Access to high-quality data is often the key bottleneck in modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic text offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods, such as private fine-tuning and private inference, are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a lightweight alternative that steers LLM generation using dataset vectors--directions in activation space that capture the distributional gap between private data and public priors. Unlike other inference-time aggregation baselines that incur high privacy costs per token, EPSVec extracts and sanitizes steering vectors just once. This decouples the privacy budget from generation, enabling unlimited synthetic sampling and high utility even in low-data regimes. Experiments show that EPSVec generates synthetic corpora that more closely match the real-data distribution than baseline methods, while eliminating model retraining and allowing scalable generations even in low data regimes.

View article