Preprint / Version 1

Jabuticaba: The largest commercial corpus for LLMs in Portuguese

##article.authors##

DOI:

https://doi.org/10.1590/SciELOPreprints.12696

Keywords:

Datasets, Large Language Model, Artificial intelligence, Natural Language Processing

Abstract

Large Language Models provide a step towards intelligent communication systems by harnessing large repositories or datasets of written human knowledge to better predict and understand the world. However, Artificial Intelligence sovereignty is all about quality data because datasets serve as the foundational infrastructure that sustains the development of LLMs. Thus, this paper presents the Jabuticaba dataset, the most extensive Portuguese language corpus for LLMs with a total data size of 669 GB and over 139 billion tokens consisting of clean, deduplicated words ready for use, including commercial use. Furthermore, Jabuticaba achieves a size comparable to and exceeding some state-of-the-art (SOTA) datasets in other languages. This paper outlines the methodological pipeline details used to build it to serve as a comprehensive reference for the research community in academia and industry in this field, as well as contributing to future studies. Resources are freely available at HuggingFace: https://huggingface.co/datasets/soberania/jabuticaba.

Downloads

Download data is not yet available.

Posted

08/05/2025

How to Cite

Jabuticaba: The largest commercial corpus for LLMs in Portuguese. (2025). In SciELO Preprints. https://doi.org/10.1590/SciELOPreprints.12696

Section

Engineering

Plaudit

Data statement

  • The research data is contained in the manuscript