ICPR 2020 Competition on HArvesting Raw Tables (ICPR-2020-CHART-UB_PMC) (ICPR2020-CHART-Info)

2021-01-09 (v. 1)

Contact author

Kenny Davila

University at Buffalo

kennydav@buffalo.edu

+504 97868084

You can cite this dataset as: Kenny Davila, ICPR 2020 Competition on HArvesting Raw Tables (ICPR-2020-CHART-UB_PMC) (ICPR2020-CHART-Info) ,1,ID:ICPR2020-CHART-Info_1,URL:https://tc11.cvc.uab.es/datasets/ICPR2020-CHART-Info_1

Dataset Information

Keywords

Chart, Real, Plots, Information Graphics, PubMedCentral

Description

This is a dataset of manually annotated chart images extracted from the PubMed Central Open Access section.  In particular, we have only selected images that have been originally published under a Creative Commons license, and we acknowledge the source of each image by giving the original PMC ids on each file name.  

There are 15 types of charts: Area, Heatmap, Horizontal Bar, Horizontal Interval, Line, Manhattan, Map, Pie, Scatter, Scatter-Line, Surface, Venn, Vertical Bar, Vertical Box, Vertical Interval. Out of these, some of them were further labeled for advanced recognition tasks (Hor. Bar, Ver. Bar,  Line, Scatter, Hor. Box and Ver. Box).  

There are several tasks associated with this dataset including:

1) Chart Classification

2) Text Detection and Recognition

3) Text Role Classification

4) Axis Analysis

5) Legend Analysis

6) Plot Element Detection and Recognition

7) End-to-End Data Extraction

Both training and testing datasets are included in this release. For more information, please visit https://chartinfo.github.io/ or contact: kxd7282@rit.edu

Technical Details

Annotations have been provided in two formats: JSON and XML. The JSON format is consistent with the synthetic training and testing datasets used for the competition. The XML format is the extended annotation used by our own chart annotation tools which can be used to further refine annotations, which can be found at: https://github.com/kdavila/ChartInfo_annotation_tools. 

 

Training dataset: 15,636 images with their corresponding JSON and XML annotations.

Testing datasets:  7,287 images with their corresponding JSON and XML annotations.

For the testing set, the Ground Truth available per image depends on the split where the image is found. There are a total of 5 splits where each one of them is used to evaluate different tasks as follows:

 
 - Split 1: Chart Image Classification (Task 1) [5,103 images]
 - Split 2: Text Detection and Recognition (Task 2) [732 images]
 - Split 3: Text Role Classification, Axis Understanding, Legend Understanding (Tasks 3, 4 and 5) [726 images]
 - Split 4: Chart Data Extraction (Tasks 6a and 6b) [726 images]
 - Split 5: End-to-end Data Extraction (Task 7)     [726 images]
 
 * A total of 7287 unique images are included. 
   - Splits 1 to 4 are disjoint 
   - Splits 2 to 5 are disjoint
   - Split 5 is a subset of split 1 

FileTypeSizeDownloadsDescription
release_ICPR2020_CHARTINFO_UB_PMC_TRAIN_v1.21.zipdata(708 MB)171ICPR 2020 - CHART-Infographics - UB PMC Training Dataset
release_ICPR2020_CHARTINFO_UB_PMC_TEST_v1.0.zipdata(665 MB)122ICPR 2020 - CHART Infographics - UB PMC Testing Dataset
Comments
No comments on this dataset yet.
Valoration
In order to rate this dataset you need to be logged onLogin / Register