Synthetic Brazilian Documents Database (SBR-Doc Database)
Documents Database, ICDAR 2021 Competition, Image segmentation, Signature segmentation, documents analysis
Licence agreement: authors are asked to download, fill and sent back to the authors this Licence Agreement Form.
If you use this dataset, please consider citing the following paper:
LOPES JUNIOR, C. A. M., NEVES JUNIOR, R. B., BEZERRA, B. L. D., TOSELLI, A. H., IMPEDOVO, D.: Competition on components segmentation task of document photos. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1–15. Springer Nature, (2021). https://doi.org/10.1007/978-3-030-86337-1_45.
The dataset is composed of ID documents images from the following Brazilian document types (acronym in Portuguese):
National Driver's License (CNH),
Natural Persons Register (CPF), and
General Registration (RG).
The documents images were captured through different cell phone cameras at different resolutions. Documents that appear in the images are characterized by having different textures, colors, and lighting, on different real-world backgrounds with non-uniform patterns.
As the personal information contained in the documents cannot be disclosed, the original data has been replaced by synthetic data. We generate false information for all data present in the document (such as name, date of birth, affiliation, numerical data, among others). On the other hand, handwritten signatures were acquired from the MCYT and GPDS datasets, which were synthetically incorporated into the images of the identity documents.
The database consists of 20,000 images, 15,000 for training and 5,000 for testing. More than 45,000 GT images are also part of the base, with 15,000 GT images for each Task (1, 2, and 3).
15,000 ID Documents images for train (input);
05,000 ID Documents images for test (input);
15,000 GT images for Task 1;
15,000 GT images for Task 2;
15,000 GT images for Task 3.
Total files: 65,000 images files
Total size: ~13.3GB.
The inputs images are all in RGB color mode and ".jpg" format.
The database is split for training and testing. Input images are the same for all tasks. However, GT images (.png) are separate for each task after unzipping the database files. See the directory tree in the following figure:
Fig. Directory tree
|train.7z.001||data||(650 MB)||54||train database - input images|
|train.7z.002||data||(650 MB)||29||train database - input images|
|train.7z.004||data||(650 MB)||25||train database - input images|
|train.7z.003||data||(650 MB)||25||train database - input images|
|train.7z.005||data||(650 MB)||23||train database - input images|
|train.7z.006||data||(650 MB)||23||train database - input images|
|train.7z.007||data||(650 MB)||25||train database - input images|
|train.7z.008||data||(650 MB)||21||train database - input images|
|train.7z.009||data||(650 MB)||20||train database - input images|
|train.7z.010||data||(650 MB)||20||train database - input images|
|train.7z.011||data||(650 MB)||21||train database - input images|
|train.7z.012||data||(650 MB)||22||train database - input images|
|train.7z.013||data||(650 MB)||20||train database - input images|
|train.7z.014||data||(650 MB)||20||train database - input images|
|train.7z.015||data||(650 MB)||22||train database - input images|
|train.7z.016||data||(298 MB)||40||train database - input images|
|LICENSE AGREEMENT.pdf||other||(89 KB)||32||Licence agreement|