DOI of the published preprint https://doi.org/10.25189/2675-4916.2025.v6.n4.id863
Brazilian Linguistic Diversity Platform: Linguistic data for a Brazilian AI
DOI:
https://doi.org/10.1590/SciELOPreprints.11957Keywords:
LLM, Artificial intelligence, Linguistics, Linguistic dataAbstract
Generative artificial intelligence is based on large-scale language models (LLMs), which are trained with data most often collected without consent or in breach of copyright. LLMs are trained with billions of words and millions of parameters, but we don't know exactly which texts are selected in the training or which parameters are controlled. While unsupervised learning requires a large volume of data, demanding more and more computational costs and generating energy impacts, supervised learning with structured and tagged data can optimize this process; more than that: supervised learning with structured and tagged data resulting from language documentation projects can contribute directly to the National Artificial Intelligence Plan: “Develop advanced language models in Portuguese, with national data that encompasses our cultural, social and linguistic diversity, to strengthen sovereignty in AI.” In Brazil, in addition to Portuguese and its varieties, there are more than 250 other languages (indigenous, immigration, sign language), which are neglected in digital inclusion due to a lack of structured data. The consortium of laboratories and research groups in this INCT aims to prepare linguistic data for the training of LLMs, considering Brazil's linguistic diversity, with the development of a joint protocol for collecting linguistic data in the field, to be replicated in the groups and laboratories longitudinally, as well as transcription procedures, as well as procedures for transcribing, aligning and labeling linguistic data to create a data set that represents Brazilian linguistic diversity, and conducting studies on linguistic processing of diversity to fine-tune LLMs, helping to reduce asymmetries and prejudice resulting from training LLMs with translations from English.Downloads
Posted
How to Cite
Section
Copyright (c) 2025 Raquel Meister Ko Freitag, Marcia dos Santos Machado Vieira, Juliana Bertucci Barbosa, Miguel Oliveira Jr., Cleber Ataíde, Alana de Santana Correia, Amanda Post da Silveira, André Britto de Carvalho, Andréia Silva Araujo, Brayna Conceição dos Santos Cardoso, Claudia Andrea Rost Snichelotto, Eduardo Cardoso Martins, Eliabe dos Santos Procópio, Elisa Battisti, Elisângela Nogueira Teixeira, Fabiane Cristina Altino, Hadinei Ribeiro Batista, Hendrik Teixeira Macedo, Isabel de Oliveira e Silva Monguilhott, Iury Cleveston, Kendra Dickinson, Lilian Cristine Hübner, Luma da Silva Miranda, Mailce Borges Mota, Marcus Garcia de Sene, Marinete Rodrigues da Silva, Marta Deysiane Alves Faria Sousa, Monica Maria Guimarães Savedra, Pedro Ricardo Bin, Ronice Muller de Quadros, Sandro Marcío Drumond Alves Marengo, Silvana Silva de Farias Araújo, Túlio Sousa de Gois, Valéria Viana Sousa, Valter de Carvalho Dias

This work is licensed under a Creative Commons Attribution 4.0 International License.
Plaudit
Data statement
-
The research data is contained in the manuscript


