# A Synthetic Dataset for Clustering Handwritten Math Expression TUAT (Dset_Mix)

## Dataset Information

### Keywords

Handwritten Math Expression, Clustering, Online, Offline

### Description

This dataset, named Dset_Mix, is designed to conduct experiments and evaluate the performance of methods for clustering handwritten mathematical expressions (HMEs). The idea to form this dataset is to simulate handwritten mathematical answers (HMAs) that come from a mathematical examination (only the final answer, not the whole solution). The dataset provides simultaneously online and offline patterns.

Dset_Mix has a total of 2000 HMEs in which 263 patterns are collected from the CROHME 2016 and 1,737 patterns are synthesized by utilizing handwriting symbols also from CROHME 2016. The method of synthetic data generation we used is presented in the following paper.

“Khanh Minh Phan, Vu Tran Minh Khuong, Huy Quang Ung, and Masaki Nakagawa, "Generating Synthetic Handwritten Mathematical Expressions from a LaTeX Sequence or a MathML Script," Proc. 15th International Conference on Document Analysis and Recognition, pp. 922-927, Sydney, Australia, 2019.”

The objective of the clustering HMEs is to group the same answers together.

### Technical Details

There are two types of format in this dataset:

- Online: “.inkml” files in the “Data_inkml” folder. A LaTeX notation and a symbol-level ground truth of each HME are provided in each “.inkml” file. The format of “.inkml” files can be referred at https://www.isical.ac.in/~crohme/data2.html.
- Offline: “.png” files in the “Data_img” folder. These images are rendered from the “.inkml” files.

For both “Data_inkml” and “Data_img” folders, the data consists of 10 subgroups. Each subgroup corresponds to a question and contains 200 answers. For each question, the number of correct answer categories is from 1 to 3 and the number of incorrect answer categories is from 2 to 8. Each answer category is stored in a folder. The folders of the correct categories are denoted with a “_correct” suffix following the folders’ names. The folders of the incorrect categories are denoted with an “_incorrect” suffix following the folders’ names. For example, in “Set1”, there are 2 correct answers categories and 8 incorrect ones. The correct answers are stored in the folders “0_correct” and “1_correct”. Hence, the folder structure of "Set1" is as follow.

Set1

|___ 0_correct

|___ 1_correct

|___ 2_incorrect

|___ 3_incorrect

|___ 4_incorrect

|___ 5_incorrect

|___ 6_incorrect

|___ 7_incorrect

|___ 8_incorrect

|___ 9_incorrect

File | Type | Size | Downloads | Description |
---|---|---|---|---|

Dset_Mix.rar | data | (8 MB) | 19 | The ".inkml" files are stored in the "Data_inkml" folder. The ".png" files are stored in the "Data_img" folder. |