ICFHR 2016 Competition on Recognition of On-line Handwritten Mathematical Expressions (ICFHR-CROHME-2016)
Dataset Information
Keywords
Online Handwriting Recognition, Mathematical Expression Recognition, Isolated Symbol Recognition
Description
The dataset provides more than 12 000 expressions handwritten by hundreds of writers from different countries, merging the data sets from 4 previous CROHME competitions and adding new ressources. Writers were asked to copy printed expressions from a corpus of expressions. The corpus has been designed to cover the diversity proposed by the different tasks and chosen from an existing math corpus and from expressions embedded in Wikipedia pages. Different devices have been used (different digital pen technologies, white-board input device, tablet with sensible screen), thus different scales and resolutions are used. The dataset provides only the on-line signal.
The training and validation sets are using CROHME 2014 datasets, but 1147 new on-line handwritten expressions are used in the CROHME 2016 test dataset. Furthemore, this new package provide updated version of tools and new math ressources: more than 500K of LaTeX expressions from wikipedia to train language models.
All these ressources allow to adress 4 different tasks: Formulas recognition, Structure recognition, isolated Symbol recognition and Matrices recognition.
Technical Details
The ink corresponding to each expression is stored in an InkML file. An InkML file mainly contains three kinds of information:
- The ink: a set of traces made of points;
- The symbol level ground truth: the segmentation and label information of each symbol in the expression;
- The expression level ground truth: the MathML structure of the expression.
The two levels of ground truth information (at the symbol as well as at the expression level) are entered manually. Furthermore, some general information is added in the file:
- The channels (here, X and Y);
- The writer information (identification, handedness (left/right), age, gender, etc.), if available;
- The LaTeX ground truth (without any reference to the ink and hence, easy to render);
- The unique identification code of the ink (UI), etc.
The InkML format makes references between the digital ink of the expression, its segmentation into symbols and its MathML representation. Thus, the stroke segmentation of a symbol can be linked to its MathML representation.
The recognized expressions are the outputs of the recognition competitors' systems. It uses the same InkML format, but without the ink information (only segmentation, label and MathML structure).
The total size of the dataset is ~230Mo (unzipped).
More details available on CROHME website.
File | Type | Size | Downloads | Description |
---|---|---|---|---|
TC11_package2016.zip | data | (191 MB) | 571 | All data, tools, corpus, articles |