Jabuticaba: The largest commercial corpus for LLMs in Portuguese

Marcellus Amadeus; William Alberto Cruz Castaneda; José Roberto Homeli da Silva; Rodrigo Scotti

doi:10.1590/SciELOPreprints.12696

##article.authors##

Marcellus Amadeus SoberanIA https://orcid.org/0009-0002-7777-2562
William Alberto Cruz Castaneda SoberanIA https://orcid.org/0000-0002-9803-1387
José Roberto Homeli da Silva SoberanIA https://orcid.org/0000-0002-8825-2362
Rodrigo Scotti SoberanIA https://orcid.org/0000-0002-9937-0129

DOI:

https://doi.org/10.1590/SciELOPreprints.12696

Keywords:

Datasets, Large Language Model, Artificial intelligence, Natural Language Processing

Abstract

Large Language Models provide a step towards intelligent communication systems by harnessing large repositories or datasets of written human knowledge to better predict and understand the world. However, Artificial Intelligence sovereignty is all about quality data because datasets serve as the foundational infrastructure that sustains the development of LLMs. Thus, this paper presents the Jabuticaba dataset, the most extensive Portuguese language corpus for LLMs with a total data size of 669 GB and over 139 billion tokens consisting of clean, deduplicated words ready for use, including commercial use. Furthermore, Jabuticaba achieves a size comparable to and exceeding some state-of-the-art (SOTA) datasets in other languages. This paper outlines the methodological pipeline details used to build it to serve as a comprehensive reference for the research community in academia and industry in this field, as well as contributing to future studies. Resources are freely available at HuggingFace: https://huggingface.co/datasets/soberania/jabuticaba.