ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records (ICDAR2019HDRC)

2019-08-29 (v. 1)

Contact author

Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki

LTU, Sweden

rajkumar.saini@ltu.se, marcus.liwicki@ltu.se

+46 (0)920 491006

You can cite this dataset as: Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki, ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records (ICDAR2019HDRC) ,1,ID:ICDAR2019HDRC_1,URL:https://tc11.cvc.uab.es/datasets/ICDAR2019HDRC_1

Dataset Information

Keywords

ICDAR2019HDRC, Historical Chinese documents, Document Image Analysis, Textline Recognition, Textline Detection, Text segmentation,

Description

The dataset consists of 1172 Chinese document images mainly written in Chinese traditional Han script. The document images have been taken from different books.  The dataset is developed aiming historical documents to develop robust systems for historical document analysis. In this direction, there will be a competition named Historical Document Reading Challenge on Large Chinese Structured Family Records, in short ICDAR 2019 HDRC Chinese on this database. The objective behind this competition is to boost the research on historical document analysis. The focus of the competition is to recognize and analyze the layout, and finally detect and recognize the text lines and characters of the documents in this database.

Please follow the article 

@inproceedings{simistira2019icdar2019hdrc,

archivePrefix = {arXiv},

arxivId = {1903.03341},

eprint = {1903.03341},

author = {Saini, Rajkumar and Dobson, Derek and Morrey, Jon and Liwicki, Marcus and Simistira Liwicki, Foteini},

title = {{ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records}},

booktitle={{to appear in 15th International Conference on Document Analysis and Recognition (ICDAR)}},

year = {2019},

month = {mar}, }

Technical Details

The dataset consists of (1) document images in JPG format, (2) XML ground truth files, and (3) PNG ground truth files.

The XML files contain the ground truth information for text line bounding box coordinates, text line recognition character string and writing direction, script type, and other information.

PNG files are the ground truth for text line detection and text segmentation.

The number of text lines, size of characters in images vary. There are graphical artifacts present in many document images.

 

 

FileTypeSizeDownloadsDescription
ICDAR2019HDRCdataset.zipdata(1146 MB)72Chinese Document Images
Comments
No comments on this dataset yet.
Valoration
In order to rate this dataset you need to be logged onLogin / Register