This preprint has been published elsewhere.
DOI of the published preprint https://doi.org/10.25189/2675-4916.2025.v6.n4.id863
Preprint / Version 1

Brazilian Linguistic Diversity Platform: Linguistic data for a Brazilian AI

##article.authors##

DOI:

https://doi.org/10.1590/SciELOPreprints.11957

Keywords:

LLM, Artificial intelligence, Linguistics, Linguistic data

Abstract

Generative artificial intelligence is based on large-scale language models (LLMs), which are trained with data most often collected without consent or in breach of copyright. LLMs are trained with billions of words and millions of parameters, but we don't know exactly which texts are selected in the training or which parameters are controlled. While unsupervised learning requires a large volume of data, demanding more and more computational costs and generating energy impacts, supervised learning with structured and tagged data can optimize this process; more than that: supervised learning with structured and tagged data resulting from language documentation projects can contribute directly to the National Artificial Intelligence Plan: “Develop advanced language models in Portuguese, with national data that encompasses our cultural, social and linguistic diversity, to strengthen sovereignty in AI.” In Brazil, in addition to Portuguese and its varieties, there are more than 250 other languages (indigenous, immigration, sign language), which are neglected in digital inclusion due to a lack of structured data. The consortium of laboratories and research groups in this INCT aims to prepare linguistic data for the training of LLMs, considering Brazil's linguistic diversity, with the development of a joint protocol for collecting linguistic data in the field, to be replicated in the groups and laboratories longitudinally, as well as transcription procedures, as well as procedures for transcribing, aligning and labeling linguistic data to create a data set that represents Brazilian linguistic diversity, and conducting studies on linguistic processing of diversity to fine-tune LLMs, helping to reduce asymmetries and prejudice resulting from training LLMs with translations from English.

Downloads

Download data is not yet available.

Posted

05/21/2025

How to Cite

Brazilian Linguistic Diversity Platform: Linguistic data for a Brazilian AI. (2025). In SciELO Preprints. https://doi.org/10.1590/SciELOPreprints.11957

Section

Linguistic, literature and arts

Plaudit

Data statement

  • The research data is contained in the manuscript