Preprint / Version 2

Transkribus and the Early Portuguese Printing OCR Model: Innovations in the Transcription of Historical Documents and their Potential for the Digital Humanities

##article.authors##

DOI:

https://doi.org/10.1590/SciELOPreprints.13650

Keywords:

Digital Humanities, Historical Linguistics, OCR Transcription, Philology

Abstract

This paper presents the "Early Portuguese Printing" (EPP) Optical Character Recognition (OCR) model, developed on the Transkribus platform, and discusses the potential, challenges, and history of such tools for research with Brazilian historical documents. Transkribus, maintained by the European cooperative Read-Coop, allows researchers to train specialized AI models on the characteristics of their own corpora. The EPP model was specifically trained for the transcription of printed materials in the Portuguese language from the 16th to the 19th centuries, using a corpus of grammars and linguistic works from the period. With a training set of 142,606 words (745 pages), the EPP achieved a Character Error Rate (CER) of just 2.58%. This result represents a significant advancement, as it demonstrates the potential of such tools for creating large-scale historical quantitative corpora in less time, while maintaining accuracy in the transcription of diacritics, typographical symbols, and Greek characters—elements that often limit the effectiveness of general-purpose OCR tools. However, in addition to publicizing the tool's potential, this paper also problematizes its nature. As it belongs to a private European entity and is a SaaS product, the use of Transkribus raises questions about data centralization and the sustainability of its application in large-scale Brazilian research projects, considering the future and the volume of our historical archives.

Downloads

Download data is not yet available.

Posted

12/01/2025 — Updated on 01/27/2026

Versions

How to Cite

Transkribus and the Early Portuguese Printing OCR Model: Innovations in the Transcription of Historical Documents and their Potential for the Digital Humanities. (2026). In SciELO Preprints. https://doi.org/10.1590/SciELOPreprints.13650 (Original work published 2025)

Section

Linguistic, literature and arts

Plaudit

Version justification

Refinamento terminológico nos cabeçalhos das tabelas (Seção 3.1) e na Introdução; Substituição do ponto de código da letra grega epsilon (corrigido para U+03B5), removi a confusão anterior com homoglifo latin small letter open E (U+025B); Reescrita e melhor fundamentação das notas de rodapé 15 e 16; Revisão de alguns erros de digitação.

Data statement

  • The research data is contained in the manuscript