Adversarial Versification in Portuguese as a Jailbreak Operator in LLMs
DOI:
https://doi.org/10.1590/SciELOPreprints.14563Keywords:
adversarial versification, LLM jailbreak, guardrail vulnerabilities, model alignmentAbstract
Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study “Adversarial poetry as a universal single-turn jailbreak mechanism in large language models” demonstrates that instructions routinely refused in prose become actionable when rewritten as verse, yielding “up to 18×” more safety failures in benchmarks derived from MLCommons AILuminate. Human-crafted poems reach approximately 62% ASR, and automated poetic conversions ~43%, with some models surpassing 90% single-turn jailbreak success. The effect is structural: systems trained with RLHF, Constitutional AI, and hybrid alignment pipelines exhibit consistent degradation under minimal semiotic–formal variation. Versification shifts the prompt into sparsely supervised latent regions, revealing guardrails that depend heavily on surface-level patterns rather than abstract harmful intent. This discrepancy between apparent robustness and real-world vulnerability exposes deep limitations in current alignment regimes. The absence of adversarial-poetry evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterize scansion, metrical patterns, and prosodic variation to test for vulnerabilities specific to Lusophone poetic structures, which remain entirely unexplored.
Downloads
Posted
How to Cite
Section
Copyright (c) 2026 Joao Queiroz

This work is licensed under a Creative Commons Attribution 4.0 International License.
Reviews
No Reviews Available
Plaudit
Data statement
-
The research data is contained in the manuscript


