Desenvolvimento de um Classificador do Catálogo do Arquivo Histórico Ultramarino: Um Experimento com Processamento de Linguagem Natural e Inteligência Artificial Aplicado a Resumos Arquivísticos

Saulo Rogério Pacheco Rocha

doi:10.1590/SciELOPreprints.16461

Preprint / Version 2

Development of a Classifier for the Arquivo Histórico Ultramarino Catalog: An Experiment with Natural Language Processing and Artificial Intelligence Applied to Archival Summaries

##article.authors##

Saulo Rogério Pacheco Rocha Universidade Federal de Santa Catarina https://orcid.org/0000-0003-3715-6706

DOI:

https://doi.org/10.1590/SciELOPreprints.16461

Keywords:

Digital Humanities, Historical Sociolinguistics, Natural Language Processing, Arquivo Histórico Ultramarino

Abstract

This article describes the computational and methodological architecture of the "AHU-South Classifier" project, aimed at constructing a relational and semantically annotated corpus from approximately 7,051 archival summaries from the Arquivo Histórico Ultramarino (AHU) concerning Southern and Southeastern Brazil (1737–1828), extracted from the Projeto Resgate Barão do Rio Branco. To overcome the limitations of lexical search in unstructured large-scale datasets, a Python pipeline was developed, integrating metadata cleaning techniques, archival code reverse-engineering (CRAV/DigitArq standard), and sociolinguistic inference based on Large Language Models (LLMs). Leveraging the DeepSeek v3 API under strict zero-shot prompting constraints, the tool evaluates archival summaries to infer social categories, communication vectors, and the likelihood of scribe mediation. This analysis is synthesized into the Score of Potential Sociolinguistic Relevance (SRSP), an unprecedented quantitative metric developed as a heuristic indicator to guide researchers toward manuscripts with a higher propensity for containing syntactic innovations of colonial Brazilian Portuguese. Furthermore, this study details the semantic vectorization process of the summaries to implement a hybrid search engine (Ensemble Retrieval), which enables both specific lexical term queries and broader semantic contextual searches. In alignment with Open Science principles, this work presents the complete dataset, source code, and a publicly available interactive search interface. The final objective is to demonstrate how the alliance between Digital Humanities, Data Science, and Natural Language Processing can substantially optimize documentary selection for research in Diachronic Linguistics and Corpus Linguistics.

Downloads

Download data is not yet available.

PDF (Portuguese)

Submitted

06/09/2026

Posted

06/22/2026 — Updated on 06/22/2026

Versions

06/22/2026 (2)
06/22/2026 (1)

How to Cite

Development of a Classifier for the Arquivo Histórico Ultramarino Catalog: An Experiment with Natural Language Processing and Artificial Intelligence Applied to Archival Summaries. (2026). In SciELO Preprints. https://doi.org/10.1590/SciELOPreprints.16461

Download Citation

Section

Linguistic, literature and arts

This work is licensed under a Creative Commons Attribution 4.0 International License.

Funding data

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Grant numbers 001

Plaudit

Version justification

Correção de uma imprecisão quantitativa no Resumo e no Abstract. A versão anterior afirmava erroneamente que o corpus era composto por '71.000 resumos', confundindo a dimensão da base de dados bruta original (71.000 linhas de texto) com o escopo do recorte metodológico. O texto foi retificado para esclarecer que o corpus final analisado é composto por 7.051 resumos documentais.

Data statement

The research data is available on demand, condition justified in the manuscript
The research data is available in one or more data repository(ies)