Transkribus and the Early Portuguese Printing OCR Model: Innovations in the Transcription of Historical Documents and their Potential for the Digital Humanities
DOI:
https://doi.org/10.1590/SciELOPreprints.13650Keywords:
Digital Humanities, Historical Linguistics, OCR Transcription, PhilologyAbstract
This paper presents the "Early Portuguese Printing" (EPP) Optical Character Recognition (OCR) model, developed on the Transkribus platform, and discusses the potential, challenges, and history of such tools for research with Brazilian historical documents. Transkribus, maintained by the European cooperative Read-Coop, allows researchers to train specialized AI models on the characteristics of their own corpora. The EPP model was specifically trained for the transcription of printed materials in the Portuguese language from the 16th to the 19th centuries, using a corpus of grammars and linguistic works from the period. With a training set of 142,606 words (745 pages), the EPP achieved a Character Error Rate (CER) of just 2.58%. This result represents a significant advancement, as it demonstrates the potential of such tools for creating large-scale historical quantitative corpora in less time, while maintaining accuracy in the transcription of diacritics, typographical symbols, and Greek characters—elements that often limit the effectiveness of general-purpose OCR tools. However, in addition to publicizing the tool's potential, this paper also problematizes its nature. As it belongs to a private European entity and is a SaaS product, the use of Transkribus raises questions about data centralization and the sustainability of its application in large-scale Brazilian research projects, considering the future and the volume of our historical archives.
Downloads
Posted
Versions
- 01/27/2026 (2)
- 12/01/2025 (1)
How to Cite
Section
Copyright (c) 2025 Saulo Rogério Pacheco Rocha

This work is licensed under a Creative Commons Attribution 4.0 International License.
Plaudit
Data statement
-
The research data is contained in the manuscript


