Dataset for the competition on Post-OCR Text Correction 2019 (Post-OCR 2019) (Post-OCR_2019)
Correction of OCR errors
Given the OCR errors in their context, the participants are asked to provide, for each error, either a) only one correction or b) a ranked list of candidates for correction. Providing multiple candidates enables the evaluation of semi-automated techniques. We will thus take into account and evaluate 2 families of systems:
- "Fully-automated" systems, meant for the comparative evaluation of fully automatic OCR correction tools, where we only take into account one correction candidate;
- "Semi-automated" systems, meant for the comparative evaluation of human-assisted correction tools, where a person typically picks the right correction within a list of system-generated candidate corrections (in that case, the higher-ranked the right correction, the better the system).
As mentioned earlier, the correction task involves a list of candidate words for each error and will be evaluated on two different scenarios:
- "fully automated" scenario, taking into consideration only the highest-weighted word in each list;
- "semi-automated" scenario, exploited all the proposed corrections along with their weight.
The chosen metric considers for every token, a weighted sum of the Levenshtein distances between the correction candidates and the corresponding token in the Ground Truth. So the purpose is to minimize that distance over all the tokens.
No comments on this dataset yet.