Datasets per Topic

AcTiV is the first publicly accessible annotated dataset designed to assess the performance of different Arabic VIDEO-OCR systems. The database has been named AcTiV for Arabic Text in Video. The challenges that are addressed by AcTiV-database are in text patterns variability and presence of complex background with various objects resembling text characters. AcTiV enables users to test their systems’ abilities to locate, track and read text objects in videos. The actual version of the dataset includes 80 videos collected from 4 different Arabic news channels. In the present work, two types of video stream were chosen: Standard-Definition (720x576, 25 fps) and High-Definition (1920x1080, 25fps).We mainly focus on text displayed as overlay in news video, which can be classified into two types: static text and dynamic one.

http://www.sage-eniso.org/content/fr/20/activ-data-base.html

Two sub-datasets are created from the AcTiV database: Activ-D (D for detection) and  Activ-R (R for recognition). AcTiV-D represents a sub-dataset of nonredundant frames used to measure the performance of single-frame based methods to detect/localize text regions in still HD/SD images. AcTiV-R is a sub-dataset of cropped images used to measure the performance of Arabic OCR systems to read texts in video frames.

Typical video frames from the proposed dataset. Top Sub-figures: examples of Russia Today and ElWataniya1 frames. Bottom Sub-figures: examples of Aljazeera HD and France 24 frames.

 

A Dataset for Arabic Text Detection, Tracking and Recognition in News Videos - AcTiV16-03-2016 (v. 1)by Oussama Zayene
Ground Truth: Global xml file16-03-2016 (v. 1)by Oussama Zayene
Ground Truth: Detection Ground-truth files16-03-2016 (v. 1)by Oussama Zayene
Task: Text Detection in Arabic NewsVideo Frames16-03-2016 (v. 1)by Oussama Zayene
Task: Text Tracking in Arabic NewsVideo16-03-2016 (v. 1)by Oussama Zayene
Task: Text Recognition in Arabic NewsVideo Frames16-03-2016 (v. 1)by Oussama Zayene

The collection contains offline signature samples. The signatures were collected under the supervision of Bryan Found and Doug Rogers in the years 2002 and 2006, respectively. The images were scanned at 600dpi resolution and cropped at the Netherlands Forensic Institute for the purpose of 4NSigComp2010 signature verification competition.

Sample signatures from the 4NSigComp2010 dataset.

The signature collection for training contains 209 images. The signatures comprise 9 reference signatures by the same writer “A” and 200 questioned signatures. The 200 questioned signatures comprise 76 genuine signatures written by the reference writer in his/her normal signature style; 104 simulated/forged signatures (written by 27 forgers freehand copying the signature characteristics of the reference writer); 20 disguised signatures written by the reference writer. The disguise process comprises an attempt by the reference writer to purposefully alter his/her signature in order to avoid being identified or for him/her to deny writing the signature. The simulation/forgery process comprises an attempt by a writer to imitate the reference signature characteristics of a genuine authentic author.

The signature collection for testing contains 125 signatures. The signatures comprise 25 reference signatures by the same writer “B” and 100 questioned signatures. The 100 questioned signatures comprise 3 genuine signatures written by the reference writer in his/her normal signature style; 90 simulated signatures (written by 34 forgers freehand copying the signature characteristics of the reference writer); 7 disguised signatures written by the reference writer. All writings were made using the same make of ballpoint pen and using the same make of paper.

Any use of these data must cite the following reference.

Forensic signature verification competition 4NSigComp2010-detection of simulated and disguised signatures M Liwicki, CE van den Heuvel, B Found, MI Malik - ICFHR, 2010.pp. 715--720. 

 

ICFHR 2010 Signature Verification Competition (4NSigComp2010)23-02-2015 (v. 1)by Muhammad Imran Malik
Ground Truth: Writer ID Information for the 4NSigComp2010 dataset06-03-2015 (v. 1)by Muhammad Imran Malik
Task: Signature Verification23-02-2015 (v. 1)by Muhammad Imran Malik

Objectives

The objective of this competition is to allow researchers and practitioners from academia and industries to compare their performance in signature verification on new unpublished forensic-like datasets (Dutch, Japanese). Skilled forgeries and genuine signatures were collected while writing on a paper in some cases it was attached to a digitizing tablet. The collected signature data are available in an offline format and some signatures are also available in online format. Participants can choose to compete on the online data or offline data only, or can choose to combine both data formats.

Similar to the last ICDAR competition, our aim is to compare different signature verification algorithms systematically for the forensic community, with the objective to establish a benchmark on the performance of such methods (providing new unpublished forensic-like datasets with authentic and skilled forgeries in both on- and offline format).

Background

The Forensic Handwriting Examiner (FHE) weighs the likelihood of the observations given (at least) two hypotheses:

H1: The questioned signature is an authentic signature of the reference writer; H2: The questioned signature is written by a writer other than the reference writer;

The interpretation of the observed similarities/differences in signature analysis is not as straightforward as in other forensic disciplines such as DNA or fingerprint evidence, because signatures are a product of a behavioral process that can be manipulated by the reference writer himself, or by another person than the reference writer. In this competition, we ask to produce two probability scores: The probability of observing the evidence (e.g. a certain similarity score) given H1 is true, and the probability of observing the evidence given H2 is true. In this competition only such cases of H2 exist, where the forger is not the reference writer. This continues the last successful competitions on ICDAR 2009 and 2011.

Signature Verification and Writer Identification Competitions for On- and Offline Skilled Forgeries 24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigDutch (Dutch Signatures: Off-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigJapanese (Japanese Signatures: Off-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigJapanese (Japanese Signatures: On-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task Wi (Writer Identification & Retrieval)24-02-2015 (v. 1)by Muhammad Imran Malik

Description:

This collection contains table structure ground truth data (rows, columns, cells etc) for document images containing tables in the UNLV and UW3 datasets.

The ground truth that we provide is stored in XML format which stores row, column boundaries, bounding boxes of cells and additional attributes such as row-spanning column-spanning cells.The XML ground truth files have the same basename as the name of the corresponding image in the respective dataset.

These XML files can then be used to generate color encoded ground truth images in PNG format which can be directly used by the pixel accurate benchmarking framework described in [1]. Generation of 16bit color encoded ground truth images require the ground truth XML file and the word bounding box OCR results file. We provide these OCR result files for all the images in the dataset and each file has the same name as the basename of the image file in the original dataset.

We used the T-Truth tool, also provided below, to prepare ground truth information. The tool is easy to use and is described in [1]. We trained a user to operate the T-Truth tool and asked him to prepare the ground truth for the target images from above dataset. The ground truth for each image is stored in an XML. The ground truths were manually validated by another expert using the preview edit mode of the T-Truth tool and improper ground truths were corrected. These iterations were made several times to ensure the accuracy of the ground truth.

Tables in UNLV dataset:

The original dataset contains 2889 pages of scanned document images from variety of sources (Magazines, News papers, Business Letter, Annual Report etc). The scanned images are provided at 200 and 300 DPI resolution in bitonal, grey and fax format. There is ground truth data provided alongside the original dataset which contains manually marked zones; zone types are provided in text format.

Closer examination of the dataset reveals that there are no marked table zones in the fax images, so this subset is not considered here. The grey images are all also present in bitonal format, therefore we concentrated on bitonal documents with resolution of 300 dpi for the preparation of ground truth. We selected those images for which table zones have been marked in the ground truth. There are around 427 such images. We provide table structure ground truths for these document images.

Tables in UW3 Dataset:

The original dataset consists of 1600 skew-corrected English document images with manually edited ground-truth of entity bounding boxes. These bounding boxes enclose page frame, text and non-text zones, textlines, and words. The type of each zone (text, math, table, half-tone, ...) is also marked. There are around 120 document images containing at least one marked table zone. We provide table structure ground truth for these document images.

[1]. Asif Shahab, Faisal Shafait, Thomas Kieninger and Andreas Dengel, "An Open Approach towards the benchmarking of table structure recognition systems", Proceedings of DAS’10, pp. 113-120, June 9-11, 2010, Boston, MA, USA

Table Ground Truth for the UW3 and UNLV datasets24-06-2014 (v. 1)by Asif Shahab
Ground Truth: Table structure and OCR GT dataset for UW3 and UNLV datasets24-06-2014 (v. 1)by Asif Shahab

Complex Text Containers

Scene Text

Figure 1. Typical images from MSRA-TD500. Notice the red rectangles. They indicate the texts within them are labelled as difficult (due to blur or occlusion).

The MSRA Text Detection 500 Database (MSRA-TD500) is collected and released publicly as a benchmark to evaluate text detection algorithms, for the purpose of tracking the recent progresses in the field of text detection in natural images, especially the advances in detecting texts of arbitrary orientations.

The MSRA Text Detection 500 Database (MSRA-TD500) contains 500 natural images, which are taken from indoor (office and mall) and outdoor (street) scenes using a pocket camera. The indoor images are mainly signs, doorplates and caution plates while the outdoor images are mostly guide boards and billboards in complex background. The resolutions of the images vary from 1296x864 to 1920x1280.

The dataset is challenging because of both the diversity of the texts and the complexity of the background in the images. The text may be in different languages (Chinese, English or mixture of both), fonts, sizes, colors and orientations. The background may contain vegetation (e.g. trees and bushes) and repeated patterns (e.g. windows and bricks), which are not so distinguishable from text.

The dataset is divided into two parts: training set and test set. The training set contains 300 images randomly selected from the original dataset and the remaining 200 images constitute the test set. All the images in this dataset are fully annotated. The basic unit in this dataset is text line (see Figure 1) rather than word, which is used in the ICDAR datasets, because it is hard to partition Chinese text lines into individual words based on their spacing; even for English text lines, it is non-trivial to perform word partition without high level information.

MSRA Text Detection 500 Database13-01-2014 (v. 1)by Cong Yao
avg.: 4.8 by 8 users
Ground Truth: GT for MSRA Text Detection 500 Database13-01-2014 (v. 1)by Cong Yao
avg.: 4.6 by 5 users
Task: Text Detection in Natural Images13-01-2014 (v. 1)by Cong Yao

Electronic Documents

Tables

SOURCE WEBSITES:

Statistics Canada  http://www.statcan.gc.ca/start-debut-eng.html

The World Bank,   http://www.worldbank.org/

Statistics Norway,    https://www.google.com/?gws_rd=ssl#q=Statistics+Norway

Statistics Finland, http://www.stat.fi/index_en.html

US Department of Justice,    https://www.justice.gov/

Geohive,   http://www.geohive.com/

US Energy Information Administration,    https://www.eia.gov/

US Census Bureau. http://www.census.gov/

 DATA COLLECTION:  About 1000 tables were collected from international statistical websited by DocLab graduate students in2009-2010. A perceptually random subset of 200 of these tables was converted (mostly from HTML) via EXCEL into CSV files and stored at DocLab. The original HTML files were not retained and their URLs were kept somewhat haphazardly. Many of these tables can still be found by a web search, but others have been modified, corrected, or updated on the original sites. The dataset now consists only of the 200 CSV file representation of 200 web tables.

GROUND TRUTH: The four critical cells (top-left and bottom right of the stub-header and top-left and bottom-right of the data region) were entered with an interactive tool (VeriClick) in 2011. Several GT errors were subsequently found during segmentation experiments and corrected. Researchers may still disagree on 2-3 tables on whether a particular row or column (unusual contents or units) belong to the data region.  The GT for all 200 tables is a single CSV file that allows adding to it and verifying the results of a header segmentation experiment.

 

TANGO-DocLab web tables from international statistical sites16-03-2016 (v. 1)by George Nagy
Ground Truth: Critical cells for table header segmentation16-03-2016 (v. 1)by George Nagy
Task: table segmentation16-03-2016 (v. 1)by George Nagy

Handwritten Documents

The dataset provide more than 11 000 expressions handwritten by hundreds of writers from different countries, merging the data sets from 4 CROHME competitions. Writers were asked to copy printed expressions from a corpus of expressions. The corpus has been designed to cover the diversity proposed by the different tasks and chosen from an existing math corpus and from expressions embedded in Wikipedia pages. Different devices have been used (different digital pen technologies, white-board input device, tablet with sensible screen), thus different scales and resolutions are used. The dataset provides only the on-line signal.

In the last competition CROHME 2013 the test part is completely original and the train part is using 5 existing data sets:

MathBrush  (University  of  Waterloo),  HAMEX  (University  of  Nantes),  MfrDB  (Czech Technical  University),  ExpressMatch  (University  of Sao Paulo), the KAIST data set.

In CROHME 2014 a new test set has been created with 987 new expressions and 2 new tasks has been added: isolated symbol recognition and matrix recognition. Train and test files as the evaluation scripts for these new tasks are provided. For the isolated symbol datasets, elements are extracted from full expression using the existing datasets, which also includes segmentation errors. For the matrix recognition task, 380 new expressions have been labelled and split into training and test sets.

Furthermore, 6 participants of the 2012 competition provide their recognized expressions for the 2012 test part. This data allows research on decision fusion or evaluation metrics...

ICFHR 2014 CROHME: Fourth International Competition on Recognition of Online Handwritten Mathematical Expressions16-02-2015 (v. 2)by Harold Mouchère
Ground Truth: CROHME16-02-2015 (v. 1)by Harold Mouchère
Task: Mathematical Expression Recognition16-02-2015 (v. 1)by Harold Mouchère
Task: Isolated Mathematical Symbol Recognition16-02-2015 (v. 1)by Harold Mouchère
Task: Matrix Recognition16-02-2015 (v. 1)by Harold Mouchère

Off-line

This is the dataset of the ICDAR 2013 - Gender Identification from Handwriting competition. If you use this database, please consider citing it as in [1].

This dataset is a subset of the QUWI dataset [2]. In sum, a total of 475 writers produced 4 handwritten documents: the first page contains an Arabic handwritten text which varies from one writer to another, the second page contains an Arabic handwritten text which is the same for all the writers, the third page contains an English handwritten text which varies from one writer to another and the fourth page contains an English handwritten text which is the same for all the writers.

Images have been acquired using an EPSON GT-S80 scanner, with a 300 DPI resolution. Images were provided in
JPG uncompressed format. The training set consists of the first 282 writers for which the genders are provided.
Participants were asked to predict the gender of the remaining 193 writers.

In addition to images, features extracted from the data were also provided in order to make the competition accessible by people without image processing skills. Those features are described in [3].

[1] Hassaïne, Abdelâali, et al. "ICDAR 2013 Competition on Gender Prediction from Handwriting." Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013.

[2] Hassaïne, Abdelaali, et al. "The ICDAR2011 Arabic writer identification contest." Document Analysis and Recognition (ICDAR), 2011 International Conference on. IEEE, 2011.

[3] Al Maadeed, Somaya, and Abdelaali Hassaine. "Automatic prediction of age, gender, and nationality in offline handwriting." EURASIP Journal on Image and Video Processing 2014.1 (2014): 1-10.

 

ICDAR 2013 - Gender Identification Competition Dataset25-01-2015 (v. 1)by Abdelaali Hassaine
Ground Truth: Genders of all writers25-01-2015 (v. 1)by Abdelaali Hassaine
Task: Gender identification using all documents25-01-2015 (v. 1)by Abdelaali Hassaine

On-line

ICFHR 2016 Competition on Recognition of On-line Handwritten Mathematical Expressions18-07-2017 (v. 1)by Harold Mouchère
Ground Truth: CROHME18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Formulas18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Symbols18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Structure18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Matrices18-07-2017 (v. 1)by Harold Mouchère

Machine-printed Documents

Video-Captured

odern smartphones have a revolutionary impact on the way people digitize the paper documents. The wide ownership of smartphones and their ease of use for digitizing paper documents has resulted into massive amount of imagery data of digitized paper documents. The goal of digitizing the paper documents is not only to archive them for sharing but also, most of the times, to process them by automated document image processing systems. The latter extracts the content of the document images for recognizing it, indexing it, verifying it, comparing it with a database etc. However, it is a known fact that the cameras of the smartphones are optimized for capturing natural scene images. Taking a simple photo of a paper document does not ensure that its content would be exploitable by automated document image processing systems. This could happen because of the light conditions, the resolution of the image, the camera noise, the perspective distortion, the physical distortions (folds etc.) of the paper, the out-of-focus blur and/or the motion blur during capture. To ensure that the content of a captured document image is exploitable by automated systems, it is important to automatically assess the quality of a captured document image in real-time. Otherwise most of the times it is not possible to re-capture the document image later on, because the original document is not available anymore. Assessing the quality of a captured document image is also required in situations where the captured document images are to-be transmitted for further processing.

 

The quality assessment step is an important part of both the acquisition and the digitization processes. Assessing document quality could aid users during the capture process or help improve image enhancement methods after a document has been captured. Current state-of-the-art works lack databases in the field of document image quality assessment. 

 

In order to provide a baseline benchmark for quality assessment methods for mobile captured documents, we present a database for quality assessment that contains both single- and multiply-distorted document images.

 

capture system

 

Sample 1

Sample 2

Magnified view of blurry document

 

University of La Rochelle - SmartDoc QA11-01-2017 (v. 1)by Nibal Nayef
Ground Truth: Text Transcriptions of SmartDoc-QA Dataset22-01-2017 (v. 1)by Nibal Nayef
Task: OCR for SmartDoc-QA22-01-2017 (v. 1)by Nibal Nayef