Development of a Classifier for the Arquivo Histórico Ultramarino Catalog: An Experiment with Natural Language Processing and Artificial Intelligence Applied to Archival Summaries
DOI:
https://doi.org/10.1590/SciELOPreprints.16461Keywords:
Digital Humanities, Historical Sociolinguistics, Natural Language Processing, Arquivo Histórico UltramarinoAbstract
This article describes the computational and methodological architecture of the "AHU-South Classifier" project, aimed at constructing a relational and semantically annotated corpus from approximately 7,051 archival summaries from the Arquivo Histórico Ultramarino (AHU) concerning Southern and Southeastern Brazil (1737–1828), extracted from the Projeto Resgate Barão do Rio Branco. To overcome the limitations of lexical search in unstructured large-scale datasets, a Python pipeline was developed, integrating metadata cleaning techniques, archival code reverse-engineering (CRAV/DigitArq standard), and sociolinguistic inference based on Large Language Models (LLMs). Leveraging the DeepSeek v3 API under strict zero-shot prompting constraints, the tool evaluates archival summaries to infer social categories, communication vectors, and the likelihood of scribe mediation. This analysis is synthesized into the Score of Potential Sociolinguistic Relevance (SRSP), an unprecedented quantitative metric developed as a heuristic indicator to guide researchers toward manuscripts with a higher propensity for containing syntactic innovations of colonial Brazilian Portuguese. Furthermore, this study details the semantic vectorization process of the summaries to implement a hybrid search engine (Ensemble Retrieval), which enables both specific lexical term queries and broader semantic contextual searches. In alignment with Open Science principles, this work presents the complete dataset, source code, and a publicly available interactive search interface. The final objective is to demonstrate how the alliance between Digital Humanities, Data Science, and Natural Language Processing can substantially optimize documentary selection for research in Diachronic Linguistics and Corpus Linguistics.
Downloads
Submitted
Posted
Versions
- 06/22/2026 (2)
- 06/22/2026 (1)
How to Cite
Section
Copyright (c) 2026 Saulo Rogério Pacheco Rocha

This work is licensed under a Creative Commons Attribution 4.0 International License.
Funding data
-
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Grant numbers 001
Plaudit
Version justification
Data statement
-
The research data is available on demand, condition justified in the manuscript
-
The research data is available in one or more data repository(ies)


