Datasets per Publication

The MSRA Text Detection 500 Database (MSRA-TD500) is collected and released publicly as a benchmark to evaluate text detection algorithms, for the purpose of tracking the recent progresses in the field of text detection in natural images, especially the advances in detecting texts of arbitrary orientations.

The MSRA Text Detection 500 Database (MSRA-TD500) contains 500 natural images, which are taken from indoor (office and mall) and outdoor (street) scenes using a pocket camera. The indoor images are mainly signs, doorplates and caution plates while the outdoor images are mostly guide boards and billboards in complex background. The resolutions of the images vary from 1296x864 to 1920x1280.

The dataset is challenging because of both the diversity of the texts and the complexity of the background in the images. The text may be in different languages (Chinese, English or mixture of both), fonts, sizes, colors and orientations. The background may contain vegetation (e.g. trees and bushes) and repeated patterns (e.g. windows and bricks), which are not so distinguishable from text.

The dataset is divided into two parts: training set and test set. The training set contains 300 images randomly selected from the original dataset and the remaining 200 images constitute the test set. All the images in this dataset are fully annotated. The basic unit in this dataset is text line (see Figure 1) rather than word, which is used in the ICDAR datasets, because it is hard to partition Chinese text lines into individual words based on their spacing; even for English text lines, it is non-trivial to perform word partition without high level information.

MSRA Text Detection 500 Database13-01-2014 (v. 1)by Cong Yao
avg.: 4.7 by 7 users
Ground Truth: GT for MSRA Text Detection 500 Database13-01-2014 (v. 1)by Cong Yao
avg.: 4.6 by 5 users
Task: Text Detection in Natural Images13-01-2014 (v. 1)by Cong Yao


This collection contains table structure ground truth data (rows, columns, cells etc) for document images containing tables in the UNLV and UW3 datasets.

The ground truth that we provide is stored in XML format which stores row, column boundaries, bounding boxes of cells and additional attributes such as row-spanning column-spanning cells.The XML ground truth files have the same basename as the name of the corresponding image in the respective dataset.

These XML files can then be used to generate color encoded ground truth images in PNG format which can be directly used by the pixel accurate benchmarking framework described in [1]. Generation of 16bit color encoded ground truth images require the ground truth XML file and the word bounding box OCR results file. We provide these OCR result files for all the images in the dataset and each file has the same name as the basename of the image file in the original dataset.

We used the T-Truth tool, also provided below, to prepare ground truth information. The tool is easy to use and is described in [1]. We trained a user to operate the T-Truth tool and asked him to prepare the ground truth for the target images from above dataset. The ground truth for each image is stored in an XML. The ground truths were manually validated by another expert using the preview edit mode of the T-Truth tool and improper ground truths were corrected. These iterations were made several times to ensure the accuracy of the ground truth.

Tables in UNLV dataset:

The original dataset contains 2889 pages of scanned document images from variety of sources (Magazines, News papers, Business Letter, Annual Report etc). The scanned images are provided at 200 and 300 DPI resolution in bitonal, grey and fax format. There is ground truth data provided alongside the original dataset which contains manually marked zones; zone types are provided in text format.

Closer examination of the dataset reveals that there are no marked table zones in the fax images, so this subset is not considered here. The grey images are all also present in bitonal format, therefore we concentrated on bitonal documents with resolution of 300 dpi for the preparation of ground truth. We selected those images for which table zones have been marked in the ground truth. There are around 427 such images. We provide table structure ground truths for these document images.

Tables in UW3 Dataset:

The original dataset consists of 1600 skew-corrected English document images with manually edited ground-truth of entity bounding boxes. These bounding boxes enclose page frame, text and non-text zones, textlines, and words. The type of each zone (text, math, table, half-tone, ...) is also marked. There are around 120 document images containing at least one marked table zone. We provide table structure ground truth for these document images.

[1]. Asif Shahab, Faisal Shafait, Thomas Kieninger and Andreas Dengel, "An Open Approach towards the benchmarking of table structure recognition systems", Proceedings of DAS’10, pp. 113-120, June 9-11, 2010, Boston, MA, USA

Table Ground Truth for the UW3 and UNLV datasets24-06-2014 (v. 1)by Asif Shahab
Ground Truth: Table structure and OCR GT dataset for UW3 and UNLV datasets24-06-2014 (v. 1)by Asif Shahab
Signature Verification and Writer Identification Competitions for On- and Offline Skilled Forgeries 24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigDutch (Dutch Signatures: Off-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigJapanese (Japanese Signatures: Off-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigJapanese (Japanese Signatures: On-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task Wi (Writer Identification & Retrieval)24-02-2015 (v. 1)by Muhammad Imran Malik

ICDAR 2013 - 12th International Conference on Document Analysis and Recognition, Washington, DC, USA

This is the dataset of the ICDAR 2013 - Gender Identification from Handwriting competition. If you use this database, please consider citing it as in [1].

This dataset is a subset of the QUWI dataset [2]. In sum, a total of 475 writers produced 4 handwritten documents: the first page contains an Arabic handwritten text which varies from one writer to another, the second page contains an Arabic handwritten text which is the same for all the writers, the third page contains an English handwritten text which varies from one writer to another and the fourth page contains an English handwritten text which is the same for all the writers.

Images have been acquired using an EPSON GT-S80 scanner, with a 300 DPI resolution. Images were provided in
JPG uncompressed format. The training set consists of the first 282 writers for which the genders are provided.
Participants were asked to predict the gender of the remaining 193 writers.

In addition to images, features extracted from the data were also provided in order to make the competition accessible by people without image processing skills. Those features are described in [3].

[1] Hassaïne, Abdelâali, et al. "ICDAR 2013 Competition on Gender Prediction from Handwriting." Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013.

[2] Hassaïne, Abdelaali, et al. "The ICDAR2011 Arabic writer identification contest." Document Analysis and Recognition (ICDAR), 2011 International Conference on. IEEE, 2011.

[3] Al Maadeed, Somaya, and Abdelaali Hassaine. "Automatic prediction of age, gender, and nationality in offline handwriting." EURASIP Journal on Image and Video Processing 2014.1 (2014): 1-10.


ICDAR 2013 - Gender Identification Competition Dataset25-01-2015 (v. 1)by Abdelaali Hassaine
Ground Truth: Genders of all writers25-01-2015 (v. 1)by Abdelaali Hassaine
Task: Gender identification using all documents25-01-2015 (v. 1)by Abdelaali Hassaine


Statistics Canada

The World Bank,

Statistics Norway,

Statistics Finland,

US Department of Justice,


US Energy Information Administration,

US Census Bureau.

 DATA COLLECTION:  About 1000 tables were collected from international statistical websited by DocLab graduate students in2009-2010. A perceptually random subset of 200 of these tables was converted (mostly from HTML) via EXCEL into CSV files and stored at DocLab. The original HTML files were not retained and their URLs were kept somewhat haphazardly. Many of these tables can still be found by a web search, but others have been modified, corrected, or updated on the original sites. The dataset now consists only of the 200 CSV file representation of 200 web tables.

GROUND TRUTH: The four critical cells (top-left and bottom right of the stub-header and top-left and bottom-right of the data region) were entered with an interactive tool (VeriClick) in 2011. Several GT errors were subsequently found during segmentation experiments and corrected. Researchers may still disagree on 2-3 tables on whether a particular row or column (unusual contents or units) belong to the data region.  The GT for all 200 tables is a single CSV file that allows adding to it and verifying the results of a header segmentation experiment.


TANGO-DocLab web tables from international statistical sites16-03-2016 (v. 1)by George Nagy
Ground Truth: Critical cells for table header segmentation16-03-2016 (v. 1)by George Nagy
Task: table segmentation16-03-2016 (v. 1)by George Nagy

ICDAR 2015 - 13th IAPR International Conference on Document Analysis and Recognition, Nancy, FranceICDAR 2015 - 13th IAPR International Conference on Document Analysis and Recognition, Nancy, France

A Dataset for Arabic Text Detection, Tracking and Recognition in News Videos - AcTiV16-03-2016 (v. 1)by Oussama Zayene
Ground Truth: Global xml file16-03-2016 (v. 1)by Oussama Zayene
Ground Truth: Detection Ground-truth files16-03-2016 (v. 1)by Oussama Zayene
Task: Text Detection in Arabic NewsVideo Frames16-03-2016 (v. 1)by Oussama Zayene
Task: Text Tracking in Arabic NewsVideo16-03-2016 (v. 1)by Oussama Zayene
Task: Text Recognition in Arabic NewsVideo Frames16-03-2016 (v. 1)by Oussama Zayene

odern smartphones have a revolutionary impact on the way people digitize the paper documents. The wide ownership of smartphones and their ease of use for digitizing paper documents has resulted into massive amount of imagery data of digitized paper documents. The goal of digitizing the paper documents is not only to archive them for sharing but also, most of the times, to process them by automated document image processing systems. The latter extracts the content of the document images for recognizing it, indexing it, verifying it, comparing it with a database etc. However, it is a known fact that the cameras of the smartphones are optimized for capturing natural scene images. Taking a simple photo of a paper document does not ensure that its content would be exploitable by automated document image processing systems. This could happen because of the light conditions, the resolution of the image, the camera noise, the perspective distortion, the physical distortions (folds etc.) of the paper, the out-of-focus blur and/or the motion blur during capture. To ensure that the content of a captured document image is exploitable by automated systems, it is important to automatically assess the quality of a captured document image in real-time. Otherwise most of the times it is not possible to re-capture the document image later on, because the original document is not available anymore. Assessing the quality of a captured document image is also required in situations where the captured document images are to-be transmitted for further processing.


The quality assessment step is an important part of both the acquisition and the digitization processes. Assessing document quality could aid users during the capture process or help improve image enhancement methods after a document has been captured. Current state-of-the-art works lack databases in the field of document image quality assessment. 


In order to provide a baseline benchmark for quality assessment methods for mobile captured documents, we present a database for quality assessment that contains both single- and multiply-distorted document images.



University of La Rochelle - SmartDoc QA11-01-2017 (v. 1)by Nibal Nayef
Ground Truth: Text Transcriptions of SmartDoc-QA Dataset22-01-2017 (v. 1)by Nibal Nayef
Task: OCR for SmartDoc-QA22-01-2017 (v. 1)by Nibal Nayef

ICFHR 2014 - 14th International Conference on Frontiers in Handwriting Recognition, Crete Island, GreeceICFHR 2014 - 14th International Conference on Frontiers in Handwriting Recognition, Crete Island, Greece

The dataset provide more than 11 000 expressions handwritten by hundreds of writers from different countries, merging the data sets from 4 CROHME competitions. Writers were asked to copy printed expressions from a corpus of expressions. The corpus has been designed to cover the diversity proposed by the different tasks and chosen from an existing math corpus and from expressions embedded in Wikipedia pages. Different devices have been used (different digital pen technologies, white-board input device, tablet with sensible screen), thus different scales and resolutions are used. The dataset provides only the on-line signal.

In the last competition CROHME 2013 the test part is completely original and the train part is using 5 existing data sets:

MathBrush  (University  of  Waterloo),  HAMEX  (University  of  Nantes),  MfrDB  (Czech Technical  University),  ExpressMatch  (University  of Sao Paulo), the KAIST data set.

In CROHME 2014 a new test set has been created with 987 new expressions and 2 new tasks has been added: isolated symbol recognition and matrix recognition. Train and test files as the evaluation scripts for these new tasks are provided. For the isolated symbol datasets, elements are extracted from full expression using the existing datasets, which also includes segmentation errors. For the matrix recognition task, 380 new expressions have been labelled and split into training and test sets.

Furthermore, 6 participants of the 2012 competition provide their recognized expressions for the 2012 test part. This data allows research on decision fusion or evaluation metrics...

ICFHR 2014 CROHME: Fourth International Competition on Recognition of Online Handwritten Mathematical Expressions16-02-2015 (v. 2)by Harold Mouchère
Ground Truth: CROHME16-02-2015 (v. 1)by Harold Mouchère
Task: Mathematical Expression Recognition16-02-2015 (v. 1)by Harold Mouchère
Task: Isolated Mathematical Symbol Recognition16-02-2015 (v. 1)by Harold Mouchère
Task: Matrix Recognition16-02-2015 (v. 1)by Harold Mouchère

ICFHR 2016 - 15th International Conference on Frontiers in Handwriting Recognition, China

ICFHR 2016 Competition on Recognition of On-line Handwritten Mathematical Expressions18-07-2017 (v. 1)by Harold Mouchère
Ground Truth: CROHME18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Formulas18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Symbols18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Structure18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Matrices18-07-2017 (v. 1)by Harold Mouchère