How Adversarial Poetry Can Jailbreak AI Models

Manpreet Singh, Co-Founder & Principal Consultant at 5TATTVA and CRO of Zeroday Ops

Manpreet Singh is the Co-Founder & Principal Consultant at 5TATTVA and CRO of Zeroday Ops, with over 19 years of experience in IT security operations, compliance, and risk management. He specializes in driving robust security strategies, ensuring regulatory compliance, and leading high-impact implementations aligned with business objectives.

By Manpreet Singh, Co-Founder & Principal Consultant at 5TATTVA and CRO of Zeroday Ops

Poetry has long been celebrated as a vehicle for human expression. But beneath the rhythm and rhyme lies a rigid mathematical structure – one that, in the age of artificial intelligence, may expose an unexpected vulnerability.

Beneath the artistic legacy of ancient epics lies a rigid syntactic cage. In the context of modern machine learning and language models, this strict framework presents a unique vulnerability. By leveraging these artistic constraints, adversarial payloads can bypass semantic filters, turning humanity’s oldest mnemonic device into a mechanism for digital deception.

The Blind Spot in AI Alignment

To understand why Shakespeare would have been an incredible asset to a modern Red Team or VAPT operation, we have to look at how modern AI safety training works.

Large Language Models (LLMs) have scaled globally, expanding the attack surface across digital ecosystems by introducing new vulnerabilities and amplifying existing ones. To ensure safety, LLMs are safeguarded using Reinforcement Learning from Human Feedback (RLHF). Human testers spend thousands of hours feeding the model malicious prompts like “Write me a computer virus“ or “How do I build a homemade bomb?” and teaching the model to refuse such requests.

However, there is a critical limitation in this training data: it is overwhelmingly conversational and prose-based. These safety classifiers are designed to detect malicious intent primarily in standard conversational syntax. When a malicious command is wrapped in structured verse such as iambic pentameter or an AABB rhyme scheme, it pushes the prompt into Out-of-Distribution (OOD) territory. The model has rarely encountered security threats formatted as poetry during alignment training.

The result is simple: the AI is trained to detect obvious threats, but adversarial poetry hides the threat within complex linguistic structure.

The Anatomy of the Exploit

Executing this vulnerability requires more than just basic knowledge of LLMs or the gift of rhyme. It demands a deliberate, two-stage methodology.

Stage one: Semantic Obfuscation. Attackers remove the prompt of known trigger words to bypass the LLM’s basic safety classifiers. Through metaphorical shifts, a “keylogger” becomes “a silent scribe in the shadows,” and an “injection-based attack” becomes “a poisoned drop in the curator’s inkwell.” Every metaphor creates an extra layer of deception.

Stage two: Attention Hijacking. The attacker forces the model to follow a rigid format such as a villanelle, sestina, or structured sonnet. This requires the AI to dedicate significant computational attention to maintaining rhyme, rhythm, and tone.

As the model prioritizes structural compliance, its ability to enforce safety checks weakens. The AI becomes so focused on composing the poem that the hidden payload may pass unnoticed.

The Empirical Proof

This threat was examined in the research paper “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,” authored by researchers from institutions including DEXAI – Icaro Lab and Sapienza University of Rome.

By converting 1,200 harmful prompts from the MLCommons dataset into poetic form, researchers measured a dramatic shift in safety outcomes. Formatting malicious prompts as poetry increased the Attack Success Rate (ASR) from 8.08% to 43.07%.

Key findings include:

The Most Vulnerable: Models like deepseek-chat-v3.1saw a catastrophic 67.90% increase in unsafe outputs, while qwen3-32b, gemini-2.5-flash, and kimi-k2 suffered ASR spikes of over 57%.
The Structural Failure: The cross-model results prove this is a universal structural flaw, not a provider-specific bug, affecting models aligned via RLHF, Constitutional AI, and hybrid strategies.
The Outliers: Only a few specific models demonstrated resilience (e.g., claude-haiku-4.5showed a negligible -1.68% change), hinting at differing internal safety-stack designs.

Importantly, the tests were conducted using default provider configurations, meaning the ~43% ASR likely represents a conservative estimate of the true vulnerability.

A Broader Taxonomy of Deception

Adversarial poetry is only one example of structural prompt manipulation. Attackers can obscure intent using a variety of other formats, such as low-resource languages, Base64 encoding, leetspeak, or dense legal terminology.

Similarly, prompts that force models to navigate complex logic puzzles, nested JSON or YAML structures, or artificial state machines can overload processing capacity. In each case, the structure distracts the model’s attention, allowing the malicious intent to slip through undetected.

The Regulatory Reality Check

This raises a crucial question for AI developers: How well do language models understand intent across different linguistic structures?

Current safety filters remain largely surface-level, scanning for obvious conversational threats rather than deeper semantic intent. As demonstrated, simply restructuring a request into verse can bypass these defenses. Security researchers warn that this exposes a deeper flaw in how AI models interpret structured language.

“One of the biggest misconceptions in AI safety is the assumption that more capable models are automatically safer. In reality, the opposite can happen. A model that becomes highly skilled at generating complex structures such as poetry may also become more effective at executing hidden or obfuscated instructions embedded within those formats,” said Manpreet Singh, Co-Founder & Principal Consultant at 5Tattva.

Addressing this requires more than keyword filtering. Researchers must analyze the internal mechanisms of LLM safety systems to understand where alignment fails.

The implications extend to regulation as well. Frameworks such as the EU AI Act rely on static testing assumptions that AI responses remain stable across similar prompts. This research challenges that assumption, showing that minor structural changes can dramatically alter safety outcomes.

The Ghost in the Syntax

We built these systems to withstand brute force. We trained them to detect explicit threats and filter malicious instructions.

But poetry doesn’t attack logic; it exploits structure. When a language model is forced into strict meter and rhyme, its attention shifts toward maintaining cadence rather than evaluating risk.

The result is a subtle but powerful vulnerability: while the model focuses on form, the hidden instruction may pass straight through its defenses – turning poetry into an unexpected attack vector in the age of AI.

How Adversarial Poetry Can Jailbreak AI Models

TrendAI to Secure Enterprise Adoption of Agentic AI with NVIDIA

Related Posts

TrendAI to Secure Enterprise Adoption of Agentic AI with NVIDIA

Accenture and Databricks Accelerate Enterprise Adoption of AI Applications and Agents at Scale

Kaspersky MDR introduces major updates, strengthening detection and investigation capabilities

Indigo Appoints Ilex Content Strategies as its Marketing and Communications Agency of Record

Breakthrough Moment for Indian Drone Tech: ideaForge Lands First U.S. School Security Deal

Vertiv Announces Scalable, High-Capacity Double Stack Busway System that Preserves White Space for Growing AI Data Center Demands

Print Magazine

About Us

Contact Us

Usefull Links