ICFHR 2014 CROHME: Fourth International Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME-2014)

2015-02-16 (v. 2)

Contact author

Harold Mouchère

University of Nantes / IRCCyN


+33 2-40-68-30-82


You can cite this dataset as: Harold Mouchère, ICFHR 2014 CROHME: Fourth International Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME-2014) ,2,ID:CROHME-2014_2,URL:https://tc11.cvc.uab.es/datasets/CROHME-2014_2

Dataset Information


Online Handwriting Recognition, Mathematical Expression Recognition


The dataset provide more than 11 000 expressions handwritten by hundreds of writers from different countries, merging the data sets from 4 CROHME competitions. Writers were asked to copy printed expressions from a corpus of expressions. The corpus has been designed to cover the diversity proposed by the different tasks and chosen from an existing math corpus and from expressions embedded in Wikipedia pages. Different devices have been used (different digital pen technologies, white-board input device, tablet with sensible screen), thus different scales and resolutions are used. The dataset provides only the on-line signal.

In the last competition CROHME 2013 the test part is completely original and the train part is using 5 existing data sets:

  1. MathBrush  (University  of  Waterloo), 
  2. HAMEX  (University  of  Nantes), 
  3. MfrDB  (Czech Technical  University), 
  4. ExpressMatch  (University  of Sao Paulo),
  5. the KAIST data set.

In CROHME 2014 a new test set has been created with 987 new expressions and 2 new tasks has been added: isolated symbol recognition and matrix recognition. Train and test files as the evaluation scripts for these new tasks are provided. For the isolated symbol datasets, elements are extracted from full expression using the existing datasets, which also includes segmentation errors. For the matrix recognition task, 380 new expressions have been labelled and split into training and test sets.

Furthermore, 6 participants of the 2012 competition provide their recognized expressions for the 2012 test part. This data allows research on decision fusion or evaluation metrics...

Technical Details

The ink corresponding to each expression is stored in an InkML file. An InkML file mainly contains three kinds of information:

  1. The  ink: a set of traces made of points;
  2. The  symbol level ground truth: the segmentation and label information of each symbol in the expression;
  3. The  expression level ground truth: the MathML structure of the expression.

The  two levels of ground truth information (at the symbol as well  as at  the expression level) are entered manually.  Furthermore, some  general information is added in the file:

  • The channels (here, X and Y);
  • The writer   information (identification, handedness (left/right), age, gender, etc.), if available;  
  • The LaTeX ground truth (without any reference to the ink and hence, easy to render);  
  • The unique identification code of the ink (UI), etc.

The InkML format makes references between the digital ink of the expression, its segmentation into symbols and its  MathML representation. Thus, the stroke segmentation of a symbol can be linked to its MathML representation.

The recognized expressions are the outputs of the recognition competitors' systems. It uses the same InkML format, but without the ink information (only segmentation, label and MathML structure).

The total size of the dataset is  ~200Mo. (70Mo zipped)

More details available on CROHME website.

TC11_package2014.zipdata(68 MB)532Train and test datasets from last 4 competitions, some participant results files, evaluation tools and papers.
No comments on this dataset yet.
In order to rate this dataset you need to be logged onLogin / Register