Prompt Injection Through Poetry – Schneier on Security
“`html
Prompt Injection Through Poetry: A New LLM Jailbreak Technique
Table of Contents
Researchers have discovered a surprisingly effective method for bypassing safety mechanisms in Large Language Models (LLMs): crafting prompts in the form of poetry. This technique, detailed in a recent paper, demonstrates a significant vulnerability across a wide range of models.
The Discovery: Adversarial Poetry as a Jailbreak
A new research paper, “adversarial Poetry as a Worldwide Single-Turn Jailbreak Mechanism in Large Language Models,” reveals that transforming harmful prompts into poetic form dramatically increases their success rate in eliciting prohibited responses from LLMs. The study found that poetic prompts consistently outperformed their prose counterparts, achieving jailbreak success rates of up to 90% on some models.
Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to mlcommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN,manipulation,cyber-offense,and loss-of-control domains.Converting 1,200 ML-Commons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. these findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting essential limitations in current alignment methods and evaluation protocols.
The researchers tested this technique across 25 different LLMs, including both proprietary and open-weight models. The results consistently showed a significant increase in the success rate of harmful prompts when presented as poetry.
Understanding the Risks: What Domains are Affected?
The vulnerability isn’t limited to a single type of harmful request. The study mapped accomplished poetic attacks to established risk taxonomies, including:
- CBRN: Chemical, Biological, Radiological, and Nuclear threats. The ability to generate instructions related to these dangerous areas is a serious concern.
- Manipulation: Prompts designed to influence or deceive individuals.
- Cyber-Offence: Requests for data or instructions related to hacking or malicious cyber activity.
- Loss-of-Control: Scenarios where the LLM could be prompted to generate outputs that lead to unintended or harmful consequences.
CBRN stands for ”chemical, biological, radiological, nuclear.” This highlights the potential for misuse in generating information related to dangerous materials and activities.
How it effectively works: Prose vs. Verse
The researchers employed a two-pronged approach:
- Hand-Crafted Poems: A small set of 20 poems were manually created to test the core hypothesis - that poetic structure alone could alter an LLM’s refusal behavior.
- Meta-Prompt Conversion: A larger dataset of 1,200 harmful prompts from ML-Commons was automatically converted into verse using a dedicated LLM “meta-prompt.”
The key finding was that the poetic framing consistently bypassed safety mechanisms. The meta-prompt conversion method achieved an average jailbreak success rate of 43%, compared to considerably lower rates for non-poetic baselines. In some cases, the poetic versions were up to 18 times more successful at eliciting harmful responses.
The Role of Stylistic Variation
The study suggests that the stylistic variation inherent in poetry – metaphor, imagery, narrative framing – is the key to circumventing current safety mechanisms. LLMs appear to be less effective at identifying and blocking harmful intent when it’s expressed through artistic language.
Data & Results: A Closer Look
| Method |
|---|
