This is an outdated version published on 06/22/2026. Read the most recent version.

Preprint / Version 1

Development of a Classifier for the Arquivo Histórico Ultramarino Catalog: An Experiment with Natural Language Processing and Artificial Intelligence Applied to Archival Summaries

##article.authors##

Saulo Rogério Pacheco Rocha Universidade Federal de Santa Catarina https://orcid.org/0000-0003-3715-6706

DOI:

https://doi.org/10.1590/SciELOPreprints.16461

Keywords:

Digital Humanities, Historical Sociolinguistics, Natural Language Processing, Arquivo Histórico Ultramarino

Abstract

This article describes the computational and methodological architecture of the "AHU-South Classifier" project, aimed at constructing a relational and semantically annotated corpus from approximately 71,000 archival summaries from the Arquivo Histórico Ultramarino (AHU) concerning Southern and Southeastern Brazil (1737–1828), extracted from the Projeto Resgate Barão do Rio Branco. To overcome the limitations of lexical search in unstructured large-scale datasets, a Python pipeline was developed, integrating metadata cleaning techniques, archival code reverse-engineering (CRAV/DigitArq standard), and sociolinguistic inference based on Large Language Models (LLMs). Leveraging the DeepSeek v3 API under strict zero-shot prompting constraints, the tool evaluates archival summaries to infer social categories, communication vectors, and the likelihood of scribe mediation. This analysis is synthesized into the Score of Potential Sociolinguistic Relevance (SRSP), an unprecedented quantitative metric developed as a heuristic indicator to guide researchers toward manuscripts with a higher propensity for containing syntactic innovations of colonial Brazilian Portuguese. Furthermore, this study details the semantic vectorization process of the summaries to implement a hybrid search engine (Ensemble Retrieval), which enables both specific lexical term queries and broader semantic contextual searches. In alignment with Open Science principles, this work presents the complete dataset, source code, and a publicly available interactive search interface. The final objective is to demonstrate how the alliance between Digital Humanities, Data Science, and Natural Language Processing can substantially optimize documentary selection for research in Diachronic Linguistics and Corpus Linguistics.

Downloads

Download data is not yet available.

PDF (Portuguese)

Submitted

06/09/2026

Posted

06/22/2026

Versions

06/22/2026 (2)
06/22/2026 (1)

How to Cite

Development of a Classifier for the Arquivo Histórico Ultramarino Catalog: An Experiment with Natural Language Processing and Artificial Intelligence Applied to Archival Summaries. (2026). In SciELO Preprints. https://doi.org/10.1590/SciELOPreprints.16461