Building safe GenAI applications: An end-to-end overview of red teaming for Large Language Models
A survey covering attack methods, evaluation, metrics, and tools for identifying and mitigating GenAI application vulnerabilities.
The rapid growth of Large Language Models (LLMs) presents significant privacy, security, and ethical concerns. While much research has proposed methods for defending LLM systems against misuse by malicious actors, researchers have recently complemented these efforts with an offensive approach that involves red teaming, i.e., proactively attacking LLMs with the purpose of identifying their vulnerabilities. This paper provides a concise and practical overview of the LLM red teaming literature, structured so as to describe a multi-component system end-to-end. To motivate red teaming we survey the initial safety needs of some high-profile LLMs, and then dive into the different components of a red teaming system as well as software packages for implementing them. We cover various attack methods, strategies for attack-success evaluation, metrics for assessing experiment outcomes, as well as a host of other considerations. Our survey will be useful for any reader who wants to rapidly obtain a grasp of the major red teaming concepts for their own use in practical applications.
Latest publications
PoisonedParrot: subtle data poisoning attacks to elicit copyright-infringing content from Large Language Models
A stealthy data poisoning attack that induces an LLM to generate copyrighted content.
NAACLProxyLM: predicting language model performance on multilingual tasks via proxy models
A scalable framework using proxy models to efficiently predict the performance of multilingual language models on NLP tasks.
NAACLWorldCuisines: a massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines
A massive multilingual and multicultural visual question answering benchmark for evaluating VLMs on global cuisines.
NAACL