Synthetic Brazilian Documents Database (SBR-Doc Database)

2021-08-22 (v. 1)

Contact author

Celso A M Lopes Junior

Universidade de Pernambuco

camlj@ecomp.poli.br

+55 81 992469364

You can cite this dataset as: Celso A M Lopes Junior, Synthetic Brazilian Documents Database (SBR-Doc Database) ,1,ID:SBR-Doc Database_1,URL:https://tc11.cvc.uab.es/datasets/SBR-Doc Database_1

Dataset Information

Keywords

Documents Database, ICDAR 2021 Competition, Image segmentation, Signature segmentation, documents analysis

Description

Licence agreement: authors are asked to download, fill and sent back to the authors this Licence Agreement Form.

 

If you use this dataset, please consider citing the following paper:

LOPES JUNIOR, C. A. M., NEVES JUNIOR, R. B., BEZERRA, B. L. D., TOSELLI, A. H., IMPEDOVO, D.: Competition on components segmentation task of document photos. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1–15. Springer Nature, (2021). https://doi.org/10.1007/978-3-030-86337-1_45.

 

The dataset is composed of ID documents images from the following Brazilian document types (acronym in Portuguese):

National Driver's License (CNH),
Natural Persons Register (CPF), and
General Registration (RG).

The documents images were captured through different cell phone cameras at different resolutions. Documents that appear in the images are characterized by having different textures, colors, and lighting, on different real-world backgrounds with non-uniform patterns. 

As the personal information contained in the documents cannot be disclosed, the original data has been replaced by synthetic data. We generate false information for all data present in the document (such as name, date of birth, affiliation, numerical data, among others). On the other hand, handwritten signatures were acquired from the MCYT and GPDS datasets, which were synthetically incorporated into the images of the identity documents.

Technical Details

The database consists of 20,000 images, 15,000 for training and 5,000 for testing. More than 45,000 GT images are also part of the base, with 15,000 GT images for each Task (1, 2, and 3).

15,000 ID Documents images for train (input);
05,000 ID Documents images for test (input);
15,000 GT images for Task 1;
15,000 GT images for Task 2;
15,000 GT images for Task 3.
Total files: 65,000 images files

Total size: ~13.3GB.
The inputs images are all in RGB color mode and ".jpg" format.

The database is split for training and testing. Input images are the same for all tasks. However, GT images (.png) are separate for each task after unzipping the database files. See the directory tree in the following figure:

Fig. Directory tree

 

 

 

 

 

 

 

 

FileTypeSizeDownloadsDescription
train.7z.001data(650 MB)83train database - input images
train.7z.002data(650 MB)43train database - input images
train.7z.004data(650 MB)35train database - input images
train.7z.003data(650 MB)34train database - input images
train.7z.005data(650 MB)32train database - input images
train.7z.006data(650 MB)33train database - input images
train.7z.007data(650 MB)34train database - input images
train.7z.008data(650 MB)30train database - input images
train.7z.009data(650 MB)29train database - input images
train.7z.010data(650 MB)29train database - input images
train.7z.011data(650 MB)30train database - input images
train.7z.012data(650 MB)32train database - input images
train.7z.013data(650 MB)30train database - input images
train.7z.014data(650 MB)29train database - input images
train.7z.015data(650 MB)31train database - input images
train.7z.016data(298 MB)56train database - input images
LICENSE AGREEMENT.pdfother(89 KB)41Licence agreement
Comments
No comments on this dataset yet.
Valoration
In order to rate this dataset you need to be logged onLogin / Register