Dataset for the competition on Post-OCR Text Correction 2019 (Post-OCR 2019) (Post-OCR_2019)
Text Correction, OCR
Dataset page on Zenodo: https://zenodo.org/record/3515403
The proposed dataset accounts for 22M OCR-ed characters (754 025 tokens) along with the corresponding groundtruth, with an unequally share of 10 European languages.
The digitized documents come from different collections available, among others, in national libraries or universities. The corresponding GT comes from initiatives such as HIMANIS, IMPACT, IMPRESSO, Open data of National Library of Finland, 4HistOCR and RECEIPT.
Degraded documents sometimes result in highly noisy OCR output and thus cannot reasonably be fully aligned with their GT. The unaligned sequences have not been included in the presented statistics (e.g. number of characters and error rates).
Error rates vary according to the nature and the state of degradation of the documents. Historical books for example, due to their complex layout and their original fonts have been reported to be especially challenging for OCR engines with up to 50% of wrongly detected characters in some documents.