Dataset for the competition on Post-OCR Text Correction 2019 (Post-OCR 2019) (Post-OCR_2019)

Research Tasks

Correction of OCR errors

2019-10-20 (v. 1)

Contact author

Christophe Rigaud

L3i - University of La Rochelle

christophe.rigaud@univ-lr.fr

+33 5 46 45 82 62

+ 33 5 46 45 82 42

Description

Given the OCR errors in their context, the participants are asked to provide, for each error, either a) only one correction or b) a ranked list of candidates for correction. Providing multiple candidates enables the evaluation of semi-automated techniques. We will thus take into account and evaluate 2 families of systems:

  1. "Fully-automated" systems, meant for the comparative evaluation of fully automatic OCR correction tools, where we only take into account one correction candidate;
  2. "Semi-automated" systems, meant for the comparative evaluation of human-assisted correction tools, where a person typically picks the right correction within a list of system-generated candidate corrections (in that case, the higher-ranked the right correction, the better the system).

Protocol

As mentioned earlier, the correction task involves a list of candidate words for each error and will be evaluated on two different scenarios:

  • "fully automated" scenario, taking into consideration only the highest-weighted word in each list;
  • "semi-automated" scenario, exploited all the proposed corrections along with their weight.

The chosen metric considers for every token, a weighted sum of the Levenshtein distances between the correction candidates and the corresponding token in the Ground Truth. So the purpose is to minimize that distance over all the tokens.

Comments

No comments on this dataset yet.

Add your comment

In order to comment on a dataset you need to be logged on
Register Now!

Valoration

In order to rate this dataset you need to be logged on
Register Now!