Synchromedia Multispectral Ancient Document Images Dataset (SMADI)
Dataset Information
Keywords
Ancient Documents, Multispectral images
Description
Spectral analysis of writing materials is of great importance for the study and analysis of ancient documents. Multi-spectral (MS) imaging represents an innovative and non-destructive technique for the analysis of materials. For this purpose, we collected a Multispectral image database of ancient handwritten letters.
Multispectral images of an old manuscript.
The SMADI database consists of 240 multispectral images of 30 real historical handwritten letters. All ancient manuscripts were written using iron gall ink and date from the 17th to the 20th century. Original documents were borrowed from Quebec’s national library (BAnQ Bibliotheque et Archives nationales du Quebec) and have been imaged using a CROMA CX MSI camera, producing 8 images for each document. A total of 240 images of real documents were captured, calibrated (illumination) and registrated.
Technical Details
This dataset is organised as follows:
- The database is divided into two main folders : MSI and GT.
- The MSI folder contains 30 sub-folders. Each subfolder has an ID composed of the letter "z" and a numerical number (eg. "z97"). Each folder contains 8 MS images of one ancient document, see table bellow for more details. All MS images are already calibrated for illumination and registrated.
- The GT folder is contains the corresponding 30 ground-truth images saved in binary form (1 for background and 0 for text). The ID of each image starts with the letter "z" folwed by a numerical number and ends by the letters "GT" (e.g. "z97GT")
-The writting dates for each document are given in the "age.xlsx" file :
The eight spectral bands are named as follows:
Image name Wavelength(nm) Light filter
-----------------------------------------
F1s.png 340 UV
F2s.png 500 Visible 1 (Blue)
F3s.png 600 Visible 2 (Green)
F4s.png 700 Visible 3 (Red)
F5s.png 800 IR 1
F6s.png 900 IR 2
F7s.png 1000 IR 3
F8s.png 1100 IR 4
-----------------------------------------
===========================================================
SMADI: Synchromedia Multispectral Ancient Document Images Dataset
===========================================================
This database were collected by Rachid Hedjam and Mohamed Cheriet.2013/2014
Email: mohamed.cheriet@etsmtl.ca
Synchromedia Laboratory, ETS, École de technologie supérieure, University of Quebec.
The SMADI database is freely available for non-commercial research purposes and publicly accessible. Other use requires written permission. If you are publishing scientific work based on the SMADI, you are requested to cite the following papers:
[1] Hedjam, R., & Cheriet, M. (2013). Historical document image restoration using multispectral imaging system. Pattern Recognition, 46(8), 2297-2312.
[2] Hedjam, Rachid, and Mohamed Cheriet. "Ground-truth estimation in multispectral representation space: Application to degraded document image binarization." Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013.
File | Type | Size | Downloads | Description |
---|---|---|---|---|
MSI-dataset.zip | data | (79 MB) | 70 | Dataset |
Ground-truth estimation in multispectral representation space: Application to degraded document image binarization.pdf | article | (168 KB) | 27 | ICDAR Conference paper |
Historical document image restoration using multispectral imaging system.pdf | article | (2 MB) | 19 | Pattern recognition journal paper |
dates.pdf | other | (12 KB) | 20 | dates |
References
[1] R. Hedjam, M. Cheriet Historical document image restoration using multispectral imaging system Hedjam, R., & Cheriet, M. (2013). Historical document image restoration using multispectral imaging system. Pattern Recognition, 46(8), 2297-2312. (PDF)
[2] R. Hedjam, M. CHeriet Ground-truth estimation in multispectral representation space: Application to degraded document image binarization Hedjam, Rachid, and Mohamed Cheriet. "Ground-truth estimation in multispectral representation space: Application to degraded document image binarization." Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013. (PDF)