PoisonedParrot: subtle data poisoning attacks to elicit copyright-infringing content from Large Language Models
A stealthy data poisoning attack that induces an LLM to generate copyrighted content.
As the capabilities of large language models (LLMs) continue to expand, their usage has become increasingly prevalent. However, as reflected in numerous ongoing lawsuits regarding LLM-generated content, addressing copyright infringement remains a significant challenge. In this paper, we introduce PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to generate copyrighted content even when the model has not been directly trained on the specific copyrighted material. PoisonedParrot integrates small fragments of copyrighted text into the poison samples using an off-the-shelf LLM. Despite its simplicity, evaluated in a wide range of experiments, PoisonedParrot is surprisingly effective at priming the model to generate copyrighted content with no discernible side effects. Moreover, we discover that existing defenses are largely ineffective against our attack. Finally, we make the first attempt at mitigating copyright-infringement poisoning attacks by proposing a defense: ParrotTrap. We encourage the community to explore this emerging threat model further.
Latest publications
ProxyLM: predicting language model performance on multilingual tasks via proxy models
A scalable framework using proxy models to efficiently predict the performance of multilingual language models on NLP tasks.
NAACLBuilding safe GenAI applications: An end-to-end overview of red teaming for Large Language Models
A survey covering attack methods, evaluation, metrics, and tools for identifying and mitigating GenAI application vulnerabilities.
NAACLWorldCuisines: a massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines
A massive multilingual and multicultural visual question answering benchmark for evaluating VLMs on global cuisines.
NAACL