Transkribus e o Modelo de ATR Early Portuguese Printing: Inovações na Transcrição de Documentos Históricos e suas Potencialidades para as Humanidades Digitais

Saulo Rogério Pacheco Rocha

doi:10.1590/SciELOPreprints.13650

Preprint / Version 2

Transkribus and the Early Portuguese Printing OCR Model: Innovations in the Transcription of Historical Documents and their Potential for the Digital Humanities

##article.authors##

Saulo Rogério Pacheco Rocha Universidade Federal de Santa Catarina https://orcid.org/0000-0003-3715-6706
- Writing – Original Draft Preparation

DOI:

https://doi.org/10.1590/SciELOPreprints.13650

Keywords:

Digital Humanities, Historical Linguistics, OCR Transcription, Philology

Abstract

This paper presents the "Early Portuguese Printing" (EPP) Optical Character Recognition (OCR) model, developed on the Transkribus platform, and discusses the potential, challenges, and history of such tools for research with Brazilian historical documents. Transkribus, maintained by the European cooperative Read-Coop, allows researchers to train specialized AI models on the characteristics of their own corpora. The EPP model was specifically trained for the transcription of printed materials in the Portuguese language from the 16th to the 19th centuries, using a corpus of grammars and linguistic works from the period. With a training set of 142,606 words (745 pages), the EPP achieved a Character Error Rate (CER) of just 2.58%. This result represents a significant advancement, as it demonstrates the potential of such tools for creating large-scale historical quantitative corpora in less time, while maintaining accuracy in the transcription of diacritics, typographical symbols, and Greek characters—elements that often limit the effectiveness of general-purpose OCR tools. However, in addition to publicizing the tool's potential, this paper also problematizes its nature. As it belongs to a private European entity and is a SaaS product, the use of Transkribus raises questions about data centralization and the sustainability of its application in large-scale Brazilian research projects, considering the future and the volume of our historical archives.

Downloads

Download data is not yet available.

PDF (Portuguese)

Submitted

10/03/2025

Posted

12/01/2025 — Updated on 01/27/2026

Versions

01/27/2026 (2)
12/01/2025 (1)

How to Cite

Transkribus and the Early Portuguese Printing OCR Model: Innovations in the Transcription of Historical Documents and their Potential for the Digital Humanities. (2026). In SciELO Preprints. https://doi.org/10.1590/SciELOPreprints.13650 (Original work published 2025)

Download Citation

Section

Linguistic, literature and arts

This work is licensed under a Creative Commons Attribution 4.0 International License.

Plaudit

Version justification

Refinamento terminológico nos cabeçalhos das tabelas (Seção 3.1) e na Introdução; Substituição do ponto de código da letra grega epsilon (corrigido para U+03B5), removi a confusão anterior com homoglifo latin small letter open E (U+025B); Reescrita e melhor fundamentação das notas de rodapé 15 e 16; Revisão de alguns erros de digitação.

Data statement

The research data is contained in the manuscript