Synthetic Brazilian Documents Database (SBR-Doc Database)

2021-08-22 (v. 1)

Contact author

Celso A M Lopes Junior

Universidade de Pernambuco

+55 81 992469364

You can cite this dataset as: Celso A M Lopes Junior, Synthetic Brazilian Documents Database (SBR-Doc Database) ,1,ID:SBR-Doc Database_1,URL: Database_1

Dataset Information


Documents Database, ICDAR 2021 Competition, Image segmentation, Signature segmentation, documents analysis


Licence agreement: authors are asked to download, fill and sent back to the authors this Licence Agreement Form.


If you use this dataset, please consider citing the following paper:

LOPES JUNIOR, C. A. M., NEVES JUNIOR, R. B., BEZERRA, B. L. D., TOSELLI, A. H., IMPEDOVO, D.: Competition on components segmentation task of document photos. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1–15. Springer Nature, (2021).


The dataset is composed of ID documents images from the following Brazilian document types (acronym in Portuguese):

National Driver's License (CNH),
Natural Persons Register (CPF), and
General Registration (RG).

The documents images were captured through different cell phone cameras at different resolutions. Documents that appear in the images are characterized by having different textures, colors, and lighting, on different real-world backgrounds with non-uniform patterns. 

As the personal information contained in the documents cannot be disclosed, the original data has been replaced by synthetic data. We generate false information for all data present in the document (such as name, date of birth, affiliation, numerical data, among others). On the other hand, handwritten signatures were acquired from the MCYT and GPDS datasets, which were synthetically incorporated into the images of the identity documents.

Technical Details

The database consists of 20,000 images, 15,000 for training and 5,000 for testing. More than 45,000 GT images are also part of the base, with 15,000 GT images for each Task (1, 2, and 3).

15,000 ID Documents images for train (input);
05,000 ID Documents images for test (input);
15,000 GT images for Task 1;
15,000 GT images for Task 2;
15,000 GT images for Task 3.
Total files: 65,000 images files

Total size: ~13.3GB.
The inputs images are all in RGB color mode and ".jpg" format.

The database is split for training and testing. Input images are the same for all tasks. However, GT images (.png) are separate for each task after unzipping the database files. See the directory tree in the following figure:

Fig. Directory tree









train.7z.001data(650 MB)73train database - input images
train.7z.002data(650 MB)38train database - input images
train.7z.004data(650 MB)31train database - input images
train.7z.003data(650 MB)30train database - input images
train.7z.005data(650 MB)28train database - input images
train.7z.006data(650 MB)29train database - input images
train.7z.007data(650 MB)30train database - input images
train.7z.008data(650 MB)26train database - input images
train.7z.009data(650 MB)25train database - input images
train.7z.010data(650 MB)25train database - input images
train.7z.011data(650 MB)26train database - input images
train.7z.012data(650 MB)27train database - input images
train.7z.013data(650 MB)25train database - input images
train.7z.014data(650 MB)25train database - input images
train.7z.015data(650 MB)27train database - input images
train.7z.016data(298 MB)47train database - input images
LICENSE AGREEMENT.pdfother(89 KB)36Licence agreement
No comments on this dataset yet.
In order to rate this dataset you need to be logged onLogin / Register