Dataset for the competition on Post-OCR Text Correction 2019 (Post-OCR 2019) (Post-OCR_2019)

Ground Truth

Transcription for the competition on Post-OCR Text Correction 2019

2019-10-20 (v. 1)

Contact author

Christophe Rigaud

L3i - University of La Rochelle

christophe.rigaud@univ-lr.fr

+33 5 46 45 82 62

+ 33 5 46 45 82 42

Keywords

transcription

Description

The proposed dataset accounts for 22M OCR-ed characters (754 025 tokens) along with the corresponding groundtruth, with an unequally share of 10 European languages.

The digitized documents come from different collections available, among others, in national libraries or universities. The corresponding GT comes from initiatives such as HIMANIS, IMPACT, IMPRESSO, Open data of National Library of Finland, 4HistOCR and RECEIPT.

Degraded documents sometimes result in highly noisy OCR output and thus cannot reasonably be fully aligned with their GT. The unaligned sequences have not been included in the presented statistics (e.g. number of characters and error rates).

Error rates vary according to the nature and the state of degradation of the documents. Historical books for example, due to their complex layout and their original fonts have been reported to be especially challenging for OCR engines with up to 50% of wrongly detected characters in some documents.

Comments

No comments on this dataset yet.

Add your comment

In order to comment on a dataset you need to be logged on
Register Now!

Valoration

In order to rate this dataset you need to be logged on
Register Now!