Table Ground Truth for the UW3 and UNLV datasets (DFKI-TGT-2010)

2014-06-24 (v. 1)

Contact author

Prof. Dr. Faisal Shafait

SEECS, NUST

faisal.shafait@seecs.nust.edu.pk

You can cite this dataset as: Prof. Dr. Faisal Shafait, Table Ground Truth for the UW3 and UNLV datasets (DFKI-TGT-2010) ,1,ID:DFKI-TGT-2010_1,URL:https://tc11.cvc.uab.es/datasets/DFKI-TGT-2010_1

Dataset Information

Keywords

Table structure recognition, Benchmarking table recognition algorithms, Table ground truth, Table recognition dataset, Evaluation framework for table structure recognition systems

Description

Description:

This collection contains table structure ground truth data (rows, columns, cells etc) for document images containing tables in the UNLV and UW3 datasets.

The ground truth that we provide is stored in XML format which stores row, column boundaries, bounding boxes of cells and additional attributes such as row-spanning column-spanning cells.The XML ground truth files have the same basename as the name of the corresponding image in the respective dataset.

These XML files can then be used to generate color encoded ground truth images in PNG format which can be directly used by the pixel accurate benchmarking framework described in [1]. Generation of 16bit color encoded ground truth images require the ground truth XML file and the word bounding box OCR results file. We provide these OCR result files for all the images in the dataset and each file has the same name as the basename of the image file in the original dataset.

We used the T-Truth tool, also provided below, to prepare ground truth information. The tool is easy to use and is described in [1]. We trained a user to operate the T-Truth tool and asked him to prepare the ground truth for the target images from above dataset. The ground truth for each image is stored in an XML. The ground truths were manually validated by another expert using the preview edit mode of the T-Truth tool and improper ground truths were corrected. These iterations were made several times to ensure the accuracy of the ground truth.

Tables in UNLV dataset:

The original dataset contains 2889 pages of scanned document images from variety of sources (Magazines, News papers, Business Letter, Annual Report etc). The scanned images are provided at 200 and 300 DPI resolution in bitonal, grey and fax format. There is ground truth data provided alongside the original dataset which contains manually marked zones; zone types are provided in text format.

Closer examination of the dataset reveals that there are no marked table zones in the fax images, so this subset is not considered here. The grey images are all also present in bitonal format, therefore we concentrated on bitonal documents with resolution of 300 dpi for the preparation of ground truth. We selected those images for which table zones have been marked in the ground truth. There are around 427 such images. We provide table structure ground truths for these document images.

Tables in UW3 Dataset:

The original dataset consists of 1600 skew-corrected English document images with manually edited ground-truth of entity bounding boxes. These bounding boxes enclose page frame, text and non-text zones, textlines, and words. The type of each zone (text, math, table, half-tone, ...) is also marked. There are around 120 document images containing at least one marked table zone. We provide table structure ground truth for these document images.

[1]. Asif Shahab, Faisal Shafait, Thomas Kieninger and Andreas Dengel, "An Open Approach towards the benchmarking of table structure recognition systems", Proceedings of DAS’10, pp. 113-120, June 9-11, 2010, Boston, MA, USA

 

2019-09-17 Update

New version of the T-Truth utility (file named "t-truth_v20190917.tar.xz") with the following changes:

  1. Replace dependencies on older version of Java. It is now compatible with latest version of Java and has been tested for both Ubuntu and Windows 10.
  2. The UX has been updated to make the table tagging easier and fun. We fixed some bugs and remove some annoying feature like "Evaluate Cells".
  3. A new feature has been added to mark the table flow, that is, we have introduced a new term of horizontal tables (where data headers are aligned to left) and vertical tables (where data headers are on top).

Technical Details

UNLV Table dataset contains 424 GT files in XML format.

UW3 Table dataset contains 117 GT files in XML format.

Total Size of the dataset is ~8 MB.

FileTypeSizeDownloadsDescription
unlv_table_gt.tar.gzdata(3 MB)223Table Structure Ground Truth for the UNLV dataset.
uw3_table_gt.tar.gzdata(1 MB)159Table Structure Ground Truth for the UW-3 dataset
t-truth.tar.gzsoftware(4 MB)132T-truth Software and Samples.
t-truth_v20190917.tar.xzsoftware(887 KB)51New version of the T-Truth annotation utility (2019-09-17).
Comments
GHANMI 04-22-2016 17:35
Here, I found only XML GT of the UNLV and UW3 datasets. How can I obtain the corresponding images?

Sincerely!
Vidushi Gupta 03-03-2017 04:17
I need to convert these XML into corresponding images.How can i do it.
Sara Zhalehpour 04-05-2018 23:22
Hi, How can we get the actual images?
bear 04-20-2018 04:24
Hi, I do need the original images, can you offer me one?
zouyj 05-04-2018 09:47
Hi, I only found the XML files, but no original PNG images. How can i get them? THANKS!!!
Siddharth 05-24-2018 16:05
Where can we find those 2889 PNG images for UNLV dataset?
Did anyone of the above users find the same? Please send me a link!! Thanks in advance
James 08-28-2018 20:00
Hi all,

I found the images for this dataset hosted by GoogleCode as used by for Tesseract testing:
https://github.com/tesseract-ocr/tesseract/wiki/UNLV-Testing-of-Tesseract#downloading-the-images

The data is scatted across the .zp files there, but the filenames line up with the ground truths in this page.

There is also an article on this here which includes the data more combined.
https://blog.goodaudience.com/table-detection-using-deep-learning-7182918d778
Valoration
In order to rate this dataset you need to be logged onLogin / Register