Dataset for the competition on Post-OCR Text Correction 2017 (Post-OCR 2017)
Transcription for the competition on Post-OCR Text Correction 2017
The proposed dataset has been built within the AMÉLIOCR project 1 on OCR post-correction. It accounts for 12M OCR-ed characters along with the corresponding Ground Truth (GT), with an equal share of English- and French-written documents (see Table I). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GT comes both from BnF’s internal projects and external initiatives such as Gutenberg, Europeana Newspapers, IMPACT and Wikisource.
Degraded documents sometimes result in highly noisy OCR output and thus cannot reasonably be fully aligned with their GT. The unaligned sequences have not been included in the presented statistics (e.g. number of characters and error rates). Error rates vary according to the nature and the state of degradation of the documents. Historical newspapers for example, due to their complex layout and their original fonts have been reported to be especially challenging for OCR engines with up to 10% of wrongly detected characters in some documents.