A Synthetic Dataset for Clustering Handwritten Math Expression TUAT (Dset_Mix)

2020-07-08 (v. 1)

Vu Tran Minh Khuong

Tokyo University of Agriculture and Technology


+81 070 4445 9674

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
You can cite this dataset as: Vu Tran Minh Khuong, A Synthetic Dataset for Clustering Handwritten Math Expression TUAT (Dset_Mix) ,1,ID:Dset_Mix_1,URL:http://tc11.cvc.uab.es/datasets/Dset_Mix_1

Dataset Information


Handwritten Math Expression, Clustering, Online, Offline


This dataset, named Dset_Mix, is designed to conduct experiments and evaluate the performance of methods for clustering handwritten mathematical expressions (HMEs). The idea to form this dataset is to simulate handwritten mathematical answers (HMAs) that come from a mathematical examination (only the final answer, not the whole solution). The dataset provides simultaneously online and offline patterns.

Dset_Mix has a total of 2000 HMEs in which 263 patterns are collected from the CROHME 2016 and 1,737 patterns are synthesized by utilizing handwriting symbols also from CROHME 2016. The method of synthetic data generation we used is presented in the following paper.

“Khanh Minh Phan, Vu Tran Minh Khuong, Huy Quang Ung, and Masaki Nakagawa, "Generating Synthetic Handwritten Mathematical Expressions from a LaTeX Sequence or a MathML Script," Proc. 15th International Conference on Document Analysis and Recognition, pp. 922-927, Sydney, Australia, 2019.”

The objective of the clustering HMEs is to group the same answers together.

Technical Details

There are two types of format in this dataset:

  • Online: “.inkml” files in the “Data_inkml” folder. A LaTeX notation and a symbol-level ground truth of each HME are provided in each “.inkml” file. The format of “.inkml” files can be referred at https://www.isical.ac.in/~crohme/data2.html.
  • Offline: “.png” files in the “Data_img” folder. These images are rendered from the “.inkml” files.

For both “Data_inkml” and “Data_img” folders, the data consists of 10 subgroups. Each subgroup corresponds to a question and contains 200 answers. For each question, the number of correct answer categories is from 1 to 3 and the number of incorrect answer categories is from 2 to 8. Each answer category is stored in a folder. The folders of the correct categories are denoted with a “_correct” suffix following the folders’ names. The folders of the incorrect categories are denoted with an “_incorrect” suffix following the folders’ names. For example, in “Set1”, there are 2 correct answers categories and 8 incorrect ones. The correct answers are stored in the folders “0_correct” and “1_correct”. Hence, the folder structure of "Set1" is as follow.


   |___  0_correct

   |___  1_correct

   |___  2_incorrect

   |___  3_incorrect

   |___  4_incorrect

   |___  5_incorrect

   |___  6_incorrect

   |___  7_incorrect

   |___  8_incorrect

   |___  9_incorrect

Dset_Mix.rardata(8 MB)0The ".inkml" files are stored in the "Data_inkml" folder. The ".png" files are stored in the "Data_img" folder.


