“`html

Prompt Injection Through Poetry: A New LLM Jailbreak Technique

Table of Contents

Prompt Injection Through Poetry: A New LLM Jailbreak Technique

Researchers have ⁤discovered a surprisingly effective method for bypassing safety mechanisms in Large Language Models (LLMs): crafting prompts in the form of poetry. This‌ technique, detailed in a recent‌ paper, demonstrates a significant vulnerability across a wide range of models.

The Discovery: Adversarial Poetry as a Jailbreak

A new research paper, “adversarial Poetry as a Worldwide Single-Turn Jailbreak Mechanism in Large Language Models,” reveals that transforming harmful prompts ⁤into poetic form dramatically increases ‍their ‍success rate‌ in eliciting prohibited responses from LLMs. The study found that poetic prompts consistently‌ outperformed⁢ their prose‍ counterparts, achieving jailbreak success rates of‍ up to 90% on some ⁤models.

Abstract: We ‌present evidence that adversarial poetry functions as a universal single-turn jailbreak⁣ technique⁣ for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to mlcommons and EU CoP risk taxonomies shows⁢ that poetic ⁢attacks transfer across CBRN,manipulation,cyber-offense,and loss-of-control domains.Converting⁢ 1,200 ML-Commons harmful prompts ⁤into verse via a standardized meta-prompt produced ASRs up to ⁢18⁣ times higher than their prose baselines. Outputs are evaluated using an ‌ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled⁣ subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared⁢ to non-poetic baselines), substantially outperforming non-poetic ⁣baselines and revealing a systematic vulnerability across model families and safety training approaches. these findings demonstrate that⁢ stylistic variation alone can circumvent contemporary safety mechanisms,⁤ suggesting essential limitations in⁢ current alignment methods and evaluation⁤ protocols.

The researchers tested this technique across 25 different LLMs, including both proprietary and open-weight models. ‍The ‍results consistently showed a significant increase in the success rate of harmful ⁣prompts when presented as poetry.

Understanding the Risks: What Domains are Affected?

The vulnerability isn’t limited to a single type of harmful⁤ request. The study mapped accomplished poetic attacks to established risk taxonomies, including:

CBRN: Chemical, Biological, Radiological, and Nuclear threats. The⁣ ability to generate instructions related‌ to these ‌dangerous areas is a serious ⁢concern.
Manipulation: Prompts ⁤designed to influence or deceive individuals.
Cyber-Offence: Requests for data‌ or instructions related⁢ to hacking⁣ or malicious cyber activity.
Loss-of-Control: Scenarios where the LLM could be prompted to generate outputs that lead to‌ unintended or harmful consequences.

CBRN stands for ”chemical, biological, radiological, nuclear.” ⁣This highlights the‌ potential for misuse⁣ in generating information related to dangerous materials and⁢ activities.

How⁣ it effectively works: Prose vs. Verse

The researchers employed a ⁣two-pronged ‍approach:

Hand-Crafted Poems: A small set of 20 poems were manually created to test the core ⁤hypothesis ⁣- that poetic structure ⁢alone could alter an LLM’s ‍refusal behavior.
Meta-Prompt Conversion: A larger dataset of 1,200 harmful prompts from⁢ ML-Commons was automatically converted into verse using a⁢ dedicated LLM “meta-prompt.”

The key finding was that the poetic framing consistently bypassed ⁢safety ⁣mechanisms. The meta-prompt conversion method ⁤achieved⁢ an average jailbreak success rate of ⁢43%, compared to⁣ considerably lower rates⁤ for⁢ non-poetic baselines. ‍ In some cases, the poetic versions were up to 18 times more successful at eliciting harmful‌ responses.

The Role of Stylistic Variation

The study suggests that the stylistic variation inherent in⁤ poetry – metaphor, imagery, narrative framing – is the key ⁤to circumventing current‌ safety mechanisms. LLMs appear to be less effective at identifying and‍ blocking ‍harmful intent when it’s expressed through ⁤artistic language.

Data & Results: A Closer Look

Method

Prompt Injection Through Poetry – Schneier on Security

Prompt Injection Through Poetry: A New LLM Jailbreak Technique

The Discovery: Adversarial Poetry as a Jailbreak

Understanding the Risks: What Domains are Affected?

How⁣ it effectively works: Prose vs. Verse

The Role of Stylistic Variation

Data & Results: A Closer Look

Related

Prompt Injection Through Poetry – Schneier on Security

The Discovery: Adversarial Poetry as a​ Jailbreak

Understanding the Risks: What Domains are Affected?

How⁣ it effectively works: Prose vs. Verse

The Role of Stylistic Variation

Data & Results: A Closer Look

Share this:

Related

The Discovery: Adversarial Poetry as a Jailbreak