New Database and Benchmark for Script Identification (MDIW-13)

2022-03-21 (v. 1)

Contact author

Moises Diaz

Universidad de Las Palmas de Gran Canaria, Spain

moises.diaz@ulpgc.es

.

.

You can cite this dataset as: Moises Diaz, New Database and Benchmark for Script Identification (MDIW-13) ,1,ID:MDIW-13_1,URL:https://tc11.cvc.uab.es/datasets/MDIW-13_1

Dataset Information

Keywords

script identification, printed text, handwritten text

Description

Script identification is a necessary step in some applications involving document analysis in a multi-script and multi-language environment. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspapers and handwritten letters and notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given.

The database can be freely downloaded for research purposes at:

https://www.dropbox.com/sh/bzd15dkp1990kfk/AABEY1bvZwLEcJ1SBY_Gep9da?dl=0

Please, cite our work if you find useful the database:

M. A. Ferrer, A. Das, M. Diaz, A. Morales, C. Carmona-Duarte, U. Pal (2022), "MDIW-13: New Database and Benchmark for Script Identification", Multimedia Tools and Applications, Pages 1-14. Accepted

A. Das, M. A. Ferrer, A. Morales, M. Diaz, U. Pal, et al. "SIW 2021: ICDAR Competition on Script Identification in the Wild". 16th International Conference on Document Analysis and Recognition (ICDAR 2021). Lecture Notes in Computer Science, vol 12824. Springer. Sep. 5-10, 2021, Lausanne, Switzerland, pp. 738-753. doi: 10.1007/978-3-030-86337-1_49

Comments

No comments on this dataset yet.

Add your comment

In order to comment on a dataset you need to be logged on
Register Now!

Valoration

In order to rate this dataset you need to be logged on
Register Now!

New Database and Benchmark for Script Identification