Dataset for the competition on Post-OCR Text Correction 2017 (Post-OCR 2017)

Ground Truth

Transcription for the competition on Post-OCR Text Correction 2017

2019-05-28 (v. 1)

Contact author

Guillaume Chiron

L3i - University of La Rochelle

guillaume.chiron@univ-lr.fr

+33 5 46 45 82 62

+ 33 5 46 45 82 42

Keywords

transcription

Description

The proposed dataset has been built within the AMÉLIOCR project 1 on OCR post-correction. It accounts for 12M OCR-ed characters along with the corresponding Ground Truth (GT), with an equal share of English- and French-written documents (see Table I). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GT comes both from BnF’s internal projects and external initiatives such as Gutenberg, Europeana Newspapers, IMPACT and Wikisource.

Degraded documents sometimes result in highly noisy OCR output and thus cannot reasonably be fully aligned with their GT. The unaligned sequences have not been included in the presented statistics (e.g. number of characters and error rates). Error rates vary according to the nature and the state of degradation of the documents. Historical newspapers for example, due to their complex layout and their original fonts have been reported to be especially challenging for OCR engines with up to 10% of wrongly detected characters in some documents.

Comments

No comments on this dataset yet.

Add your comment

In order to comment on a dataset you need to be logged on
Register Now!

Valoration

In order to rate this dataset you need to be logged on
Register Now!