Dataset for the competition on Post-OCR Text Correction 2019 (Post-OCR 2019) (Post-OCR_2019)

2019-10-20 (v. 1)

Contact author

Christophe Rigaud

L3i - University of La Rochelle

christophe.rigaud@univ-lr.fr

+33 5 46 45 82 62

+ 33 5 46 45 82 42

You can cite this dataset as: Christophe Rigaud, Dataset for the competition on Post-OCR Text Correction 2019 (Post-OCR 2019) (Post-OCR_2019) ,1,ID:Post-OCR_2019_1,URL:https://tc11.cvc.uab.es/datasets/Post-OCR_2019_1

Dataset Information

Dataset URL

https://sites.google.com/view/icdar2019-postcorrectionocr/dataset

Keywords

Text Correction, OCR

Description

Dataset page on Zenodo: https://zenodo.org/record/3515403

 

The proposed dataset accounts for 22M OCR-ed characters (754 025 tokens) along with the corresponding groundtruth, with an unequally share of 10 European languages.

The digitized documents come from different collections available, among others, in national libraries or universities. The corresponding GT comes from initiatives such as HIMANIS, IMPACT, IMPRESSO, Open data of National Library of Finland, 4HistOCR and RECEIPT.

Degraded documents sometimes result in highly noisy OCR output and thus cannot reasonably be fully aligned with their GT. The unaligned sequences have not been included in the presented statistics (e.g. number of characters and error rates).

Error rates vary according to the nature and the state of degradation of the documents. Historical books for example, due to their complex layout and their original fonts have been reported to be especially challenging for OCR engines with up to 50% of wrongly detected characters in some documents.

Comments
No comments on this dataset yet.
Valoration
In order to rate this dataset you need to be logged onLogin / Register