PubTabNet (PubTabNet)

2021-05-28 (v. 1)

Contact author

Antonio Jose Jimeno Yepes

University of Melbourne

antonio.jimeno@gmail.com

+61405096629

You can cite this dataset as: Antonio Jose Jimeno Yepes, PubTabNet (PubTabNet) ,1,ID:PubTabNet_1,URL:https://tc11.cvc.uab.es/datasets/PubTabNet_1

Dataset Information

Dataset URL

https://developer.ibm.com/technologies/artificial-intelligence/data/pubtabnet/

Keywords

Table recognition;Image analytics;Scientific literature

Description

PubTabNet is a large dataset for image-based table recognition, containing 516k+ images of tabular data annotated with the corresponding HTML representation of the tables.

 

PubTabNet can be used to train and evaluate image-based table recognition models. The model needs to recognize both the structure and the content of the tables, and be able to reconstruct the HTML representation of the tables solely relying on the table images. The HTML representation encodes both the structure of the tables and the content in each table cell. Position (bounding box) of table cells is also provided to support more diverse model designs. The source of the tables is PubMed Central Open Access Subset (commercial use collection). The tables (in both image and HTML format) are automatically extracted by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.

 

Inputs and Ground truth

 

Comments
No comments on this dataset yet.
Valoration
In order to rate this dataset you need to be logged onLogin / Register

PubTabNet