Jabuticaba: The largest commercial corpus for LLMs in Portuguese
DOI:
https://doi.org/10.1590/SciELOPreprints.12696Keywords:
Datasets, Large Language Model, Artificial intelligence, Natural Language ProcessingAbstract
Large Language Models provide a step towards intelligent communication systems by harnessing large repositories or datasets of written human knowledge to better predict and understand the world. However, Artificial Intelligence sovereignty is all about quality data because datasets serve as the foundational infrastructure that sustains the development of LLMs. Thus, this paper presents the Jabuticaba dataset, the most extensive Portuguese language corpus for LLMs with a total data size of 669 GB and over 139 billion tokens consisting of clean, deduplicated words ready for use, including commercial use. Furthermore, Jabuticaba achieves a size comparable to and exceeding some state-of-the-art (SOTA) datasets in other languages. This paper outlines the methodological pipeline details used to build it to serve as a comprehensive reference for the research community in academia and industry in this field, as well as contributing to future studies. Resources are freely available at HuggingFace: https://huggingface.co/datasets/soberania/jabuticaba.
Downloads
Posted
How to Cite
Section
Copyright (c) 2025 William Alberto Cruz Castaneda, Marcellus Amedeus, José Roberto Homeli da Silva, Rodrigo Scotti

This work is licensed under a Creative Commons Attribution 4.0 International License.
Plaudit
Data statement
-
The research data is contained in the manuscript


