LinguagemSimples: Simplificação Automática de Decisões Judiciais com Modelos de Linguagem de Grande Escala

João Pedro Sansão; Michel Leles

doi:10.1590/SciELOPreprints.16575

##article.authors##

João Pedro Sansão Federal University of São João del-Rei https://orcid.org/0000-0003-0095-2629
- Conceptualization
- Data Curation
- Formal Analysis
- Methodology
- Software
- Supervision
- Validation
- Visualization
- Writing – Original Draft Preparation
- Writing – Review & Editing
Michel Leles Federal University of São João del-Rei
- Conceptualization
- Formal Analysis
- Supervision

DOI:

https://doi.org/10.1590/SciELOPreprints.16575

Keywords:

Plain Language, Legal NLP, Language Models, Court Decisions, Simplification Evaluation

Abstract

The legal language of Brazilian court decisions, marked by Latinisms, technical jargon, and nested subordinate clauses, severely hinders comprehension by the average citizen. This paper presents LinguagemSimples, a pipeline for the automatic simplification of court decisions using large language models (LLMs). Sixteen techniques were evaluated: lexical rules, Big Pickle (Few-Shot, Zero-Shot, CoT), Nemotron 3 Ultra (FS, ZS, CoT), DeepSeek V4 Flash (FS, ZS, CoT), Qwen 2.5 7B (FS, ZS, CoT), GPT-5.4 Mini (FS), GPT-5.4 (full) (FS), and Gemini 3.5 Flash (FS) on 100 real STF decisions across consumer, family, and social security law. Metrics include readability (Adapted Flesch, Gunning-Fog), lexical similarity (ROUGE), and semantic preservation (BERTScore). Additionally, an LLM-as-Judge analysis (GPT-5.4 Mini) evaluated 1,500 simplified outputs across five error categories. All LLMs outperform the rule-based baseline, which actually reduced readability (-1.6 Flesch points). DeepSeek V4 Flash and Big Pickle achieved the highest readability gains (+24.3 points each), while Qwen 2.5 7B Zero-Shot led in semantic preservation (BERTScore mBERT F1=0.748). Chain-of-Thought proved counterproductive across all models, with Few-Shot being the most effective prompting strategy. GPT-5.4 Mini offered the best latency-quality trade-off (+16.4 Flesch gain, 0.697 BERTScore F1, ~2.5 s/doc), and GPT-5.4 (full) achieved the highest ROUGE-1 (0.583) and second-highest BERTScore (0.713). The LLM-as-Judge analysis revealed hallucination rates ranging from 7% (GPT-5.4 full) to 49% (Qwen 2.5 7B FS), with nuance loss being the most frequent error category across all techniques. Consumer law proved the most favorable domain for simplification (+28.2 points), while family law was the most challenging. The corpus and code are publicly available.