Preprint / Version 1

LinguagemSimples: Automatic Simplification of Court Decisions with Large Language Models

##article.authors##

  • João Pedro Sansão Federal University of São João del-Rei image/svg+xml https://orcid.org/0000-0003-0095-2629
    • Conceptualization
    • Data Curation
    • Formal Analysis
    • Methodology
    • Software
    • Supervision
    • Validation
    • Visualization
    • Writing – Original Draft Preparation
    • Writing – Review & Editing
  • Michel Leles Federal University of São João del-Rei image/svg+xml
    • Conceptualization
    • Formal Analysis
    • Supervision

DOI:

https://doi.org/10.1590/SciELOPreprints.16575

Keywords:

Plain Language, Legal NLP, Language Models, Court Decisions, Simplification Evaluation

Abstract

The legal language of Brazilian court decisions, marked by Latinisms, technical jargon, and nested subordinate clauses, severely hinders comprehension by the average citizen. This paper presents LinguagemSimples, a pipeline for the automatic simplification of court decisions using large language models (LLMs). Sixteen techniques were evaluated: lexical rules, Big Pickle (Few-Shot, Zero-Shot, CoT), Nemotron 3 Ultra (FS, ZS, CoT), DeepSeek V4 Flash (FS, ZS, CoT), Qwen 2.5 7B (FS, ZS, CoT), GPT-5.4 Mini (FS), GPT-5.4 (full) (FS), and Gemini 3.5 Flash (FS) on 100 real STF decisions across consumer, family, and social security law. Metrics include readability (Adapted Flesch, Gunning-Fog), lexical similarity (ROUGE), and semantic preservation (BERTScore). Additionally, an LLM-as-Judge analysis (GPT-5.4 Mini) evaluated 1,500 simplified outputs across five error categories. All LLMs outperform the rule-based baseline, which actually reduced readability (-1.6 Flesch points). DeepSeek V4 Flash and Big Pickle achieved the highest readability gains (+24.3 points each), while Qwen 2.5 7B Zero-Shot led in semantic preservation (BERTScore mBERT F1=0.748). Chain-of-Thought proved counterproductive across all models, with Few-Shot being the most effective prompting strategy. GPT-5.4 Mini offered the best latency-quality trade-off (+16.4 Flesch gain, 0.697 BERTScore F1, ~2.5 s/doc), and GPT-5.4 (full) achieved the highest ROUGE-1 (0.583) and second-highest BERTScore (0.713). The LLM-as-Judge analysis revealed hallucination rates ranging from 7% (GPT-5.4 full) to 49% (Qwen 2.5 7B FS), with nuance loss being the most frequent error category across all techniques. Consumer law proved the most favorable domain for simplification (+28.2 points), while family law was the most challenging. The corpus and code are publicly available.

Downloads

Download data is not yet available.

Posted

06/18/2026

How to Cite

LinguagemSimples: Automatic Simplification of Court Decisions with Large Language Models. (2026). In SciELO Preprints. https://doi.org/10.1590/SciELOPreprints.16575

Section

Exact and Earth Sciences

Plaudit

Data statement

  • The research data is available in one or more data repository(ies)