Datasets per Topic

AcTiV is the first publicly accessible annotated dataset designed to assess the performance of different Arabic VIDEO-OCR systems. The database has been named AcTiV for Arabic Text in Video. The challenges that are addressed by AcTiV-database are in text patterns variability and presence of complex background with various objects resembling text characters. AcTiV enables users to test their systems’ abilities to locate, track and read text objects in videos. The actual version of the dataset includes 80 videos collected from 4 different Arabic news channels. In the present work, two types of video stream were chosen: Standard-Definition (720x576, 25 fps) and High-Definition (1920x1080, 25fps).We mainly focus on text displayed as overlay in news video, which can be classified into two types: static text and dynamic one.

Two sub-datasets are created from the AcTiV database: Activ-D (D for detection) and  Activ-R (R for recognition). AcTiV-D represents a sub-dataset of nonredundant frames used to measure the performance of single-frame based methods to detect/localize text regions in still HD/SD images. AcTiV-R is a sub-dataset of cropped images used to measure the performance of Arabic OCR systems to read texts in video frames.

Typical video frames from the proposed dataset. Top Sub-figures: examples of Russia Today and ElWataniya1 frames. Bottom Sub-figures: examples of Aljazeera HD and France 24 frames.

 

 

Please find hereafter the download links of the detection and recognition datasets.

AcTiV-D: https://drive.google.com/file/d/1RLmdVlb0hGaeML9ZgtTvK93aLHWuJDAJ/view AcTiV-R: https://drive.google.com/file/d/1TY7jont7oJHVSmERTr6nJ9JEbfm4ARyT/view
A Dataset for Arabic Text Detection, Tracking and Recognition in News Videos - AcTiV16-03-2016 (v. 1)by Oussama Zayene
Ground Truth: Global xml file16-03-2016 (v. 1)by Oussama Zayene
Ground Truth: Detection Ground-truth files16-03-2016 (v. 1)by Oussama Zayene
Task: Text Detection in Arabic NewsVideo Frames16-03-2016 (v. 1)by Oussama Zayene
Task: Text Tracking in Arabic NewsVideo16-03-2016 (v. 1)by Oussama Zayene
Task: Text Recognition in Arabic NewsVideo Frames16-03-2016 (v. 1)by Oussama Zayene

The collection contains offline signature samples. The signatures were collected under the supervision of Bryan Found and Doug Rogers in the years 2002 and 2006, respectively. The images were scanned at 600dpi resolution and cropped at the Netherlands Forensic Institute for the purpose of 4NSigComp2010 signature verification competition.

Sample signatures from the 4NSigComp2010 dataset.

The signature collection for training contains 209 images. The signatures comprise 9 reference signatures by the same writer “A” and 200 questioned signatures. The 200 questioned signatures comprise 76 genuine signatures written by the reference writer in his/her normal signature style; 104 simulated/forged signatures (written by 27 forgers freehand copying the signature characteristics of the reference writer); 20 disguised signatures written by the reference writer. The disguise process comprises an attempt by the reference writer to purposefully alter his/her signature in order to avoid being identified or for him/her to deny writing the signature. The simulation/forgery process comprises an attempt by a writer to imitate the reference signature characteristics of a genuine authentic author.

The signature collection for testing contains 125 signatures. The signatures comprise 25 reference signatures by the same writer “B” and 100 questioned signatures. The 100 questioned signatures comprise 3 genuine signatures written by the reference writer in his/her normal signature style; 90 simulated signatures (written by 34 forgers freehand copying the signature characteristics of the reference writer); 7 disguised signatures written by the reference writer. All writings were made using the same make of ballpoint pen and using the same make of paper.

Any use of these data must cite the following reference.

Forensic signature verification competition 4NSigComp2010-detection of simulated and disguised signatures M Liwicki, CE van den Heuvel, B Found, MI Malik - ICFHR, 2010.pp. 715--720. 

 

ICFHR 2010 Signature Verification Competition (4NSigComp2010)23-02-2015 (v. 1)by Muhammad Imran Malik
Ground Truth: Writer ID Information for the 4NSigComp2010 dataset06-03-2015 (v. 1)by Muhammad Imran Malik
Task: Signature Verification23-02-2015 (v. 1)by Muhammad Imran Malik

Objectives

The objective of this competition is to allow researchers and practitioners from academia and industries to compare their performance in signature verification on new unpublished forensic-like datasets (Dutch, Japanese). Skilled forgeries and genuine signatures were collected while writing on a paper in some cases it was attached to a digitizing tablet. The collected signature data are available in an offline format and some signatures are also available in online format. Participants can choose to compete on the online data or offline data only, or can choose to combine both data formats.

Similar to the last ICDAR competition, our aim is to compare different signature verification algorithms systematically for the forensic community, with the objective to establish a benchmark on the performance of such methods (providing new unpublished forensic-like datasets with authentic and skilled forgeries in both on- and offline format).

Background

The Forensic Handwriting Examiner (FHE) weighs the likelihood of the observations given (at least) two hypotheses:

H1: The questioned signature is an authentic signature of the reference writer; H2: The questioned signature is written by a writer other than the reference writer;

The interpretation of the observed similarities/differences in signature analysis is not as straightforward as in other forensic disciplines such as DNA or fingerprint evidence, because signatures are a product of a behavioral process that can be manipulated by the reference writer himself, or by another person than the reference writer. In this competition, we ask to produce two probability scores: The probability of observing the evidence (e.g. a certain similarity score) given H1 is true, and the probability of observing the evidence given H2 is true. In this competition only such cases of H2 exist, where the forger is not the reference writer. This continues the last successful competitions on ICDAR 2009 and 2011.

Signature Verification and Writer Identification Competitions for On- and Offline Skilled Forgeries 24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigDutch (Dutch Signatures: Off-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigJapanese (Japanese Signatures: Off-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task SigJapanese (Japanese Signatures: On-line)24-02-2015 (v. 1)by Muhammad Imran Malik
Task: Task Wi (Writer Identification & Retrieval)24-02-2015 (v. 1)by Muhammad Imran Malik

Description:

This collection contains table structure ground truth data (rows, columns, cells etc) for document images containing tables in the UNLV and UW3 datasets.

The ground truth that we provide is stored in XML format which stores row, column boundaries, bounding boxes of cells and additional attributes such as row-spanning column-spanning cells.The XML ground truth files have the same basename as the name of the corresponding image in the respective dataset.

These XML files can then be used to generate color encoded ground truth images in PNG format which can be directly used by the pixel accurate benchmarking framework described in [1]. Generation of 16bit color encoded ground truth images require the ground truth XML file and the word bounding box OCR results file. We provide these OCR result files for all the images in the dataset and each file has the same name as the basename of the image file in the original dataset.

We used the T-Truth tool, also provided below, to prepare ground truth information. The tool is easy to use and is described in [1]. We trained a user to operate the T-Truth tool and asked him to prepare the ground truth for the target images from above dataset. The ground truth for each image is stored in an XML. The ground truths were manually validated by another expert using the preview edit mode of the T-Truth tool and improper ground truths were corrected. These iterations were made several times to ensure the accuracy of the ground truth.

Tables in UNLV dataset:

The original dataset contains 2889 pages of scanned document images from variety of sources (Magazines, News papers, Business Letter, Annual Report etc). The scanned images are provided at 200 and 300 DPI resolution in bitonal, grey and fax format. There is ground truth data provided alongside the original dataset which contains manually marked zones; zone types are provided in text format.

Closer examination of the dataset reveals that there are no marked table zones in the fax images, so this subset is not considered here. The grey images are all also present in bitonal format, therefore we concentrated on bitonal documents with resolution of 300 dpi for the preparation of ground truth. We selected those images for which table zones have been marked in the ground truth. There are around 427 such images. We provide table structure ground truths for these document images.

Tables in UW3 Dataset:

The original dataset consists of 1600 skew-corrected English document images with manually edited ground-truth of entity bounding boxes. These bounding boxes enclose page frame, text and non-text zones, textlines, and words. The type of each zone (text, math, table, half-tone, ...) is also marked. There are around 120 document images containing at least one marked table zone. We provide table structure ground truth for these document images.

[1]. Asif Shahab, Faisal Shafait, Thomas Kieninger and Andreas Dengel, "An Open Approach towards the benchmarking of table structure recognition systems", Proceedings of DAS’10, pp. 113-120, June 9-11, 2010, Boston, MA, USA

 

2019-09-17 Update

New version of the T-Truth utility (file named "t-truth_v20190917.tar.xz") with the following changes:

Replace dependencies on older version of Java. It is now compatible with latest version of Java and has been tested for both Ubuntu and Windows 10. The UX has been updated to make the table tagging easier and fun. We fixed some bugs and remove some annoying feature like "Evaluate Cells". A new feature has been added to mark the table flow, that is, we have introduced a new term of horizontal tables (where data headers are aligned to left) and vertical tables (where data headers are on top).
Table Ground Truth for the UW3 and UNLV datasets24-06-2014 (v. 1)by Prof. Dr. Faisal Shafait
Ground Truth: Table structure and OCR GT dataset for UW3 and UNLV datasets24-06-2014 (v. 1)by Asif Shahab

Charts

Complex Text Containers

Tobacco800 Complex Document Image Database and Groundtruth

 

Tobacco800 Document Image Database

Publicly accessible document image collection with realistic scope and complexity is important to the document image analysis and search community. Tobacco800 is a public subset of the complex document image processing (CDIP) test collection constructed by Illinois Institute of Technology, assembled from 42 million pages of documents (in 7 million multi-page TIFF images) released by tobacco companies under the Master Settlement Agreement and originally hosted at UCSF.

Tobacco800, composed of 1290 document images, is a realistic database for document image analysis research as these documents were collected and scanned using a wide variety of equipment over time. In addition, a significant percentage of Tobacco800 are consecutively numbered multi-page business documents, making it a valuable testbed for various content-based document image retrieval approaches. Resolutions of documents in Tobacco800 vary significantly from 150 to 300 DPI and the dimensions of images range from 1200 by 1600 to 2500 by 3200 pixels.

Please include the following reference(s) when citing the Tobacco800 database:

D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a test collection for complex document information processing,” in Proc. 29th Annual Int. ACM SIGIR Conference (SIGIR 2006), pp. 665–666, 2006. G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, The Complex Document Image Processing (CDIP) test collection project, Illinois Institute of Technology, 2006. http://ir.iit.edu/projects/CDIP.html. The Legacy Tobacco Document Library (LTDL), University of California, San Francisco, 2007.http://legacy.library.ucsf.edu/.

 

Overview of Tobacco800 Groundtruth v2.0

The groundtruth of the Tobacco800 document image database was created by the Language and Media Processing Laboratory, University of Maryland. This new release includes the groundtruth information on both signatures and logos in this large complex document image collection.

Including the location and dimensions of each visual entity, the XML groundtruth v2.0 also contains the true identity of each signature instance, enabling evaluation of signature matching and authorship attribution.

Please include the following reference(s) for the Tobacco800 groundtruth:

Guangyu Zhu, Yefeng Zheng, David Doermann,and Stefan Jaeger. Multi-scale Structural Saliency for Signature Detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 2007), pp. 1‐8, 2007. Guangyu Zhu and David Doermann. Automatic Document Logo Detection. In Proc. 9th Int. Conf. Document Analysis and Recognition (ICDAR 2007), pp. 864‐868, 2007.

The BibTex entries:

@inproceedings{SignatureDetection-CVPR07, AUTHOR = {Guangyu Zhu and Yefeng Zheng and David Doermann and Stefan Jaeger}, TITLE = {Multi-scale Structural Saliency for Signature Detection}, BOOKTITLE = {In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 2007)}, PAGES = {1--8}, YEAR = {2007}, }

@inproceedings{LogoDetection-ICDAR07, AUTHOR = {Guangyu Zhu and David Doermann}, TITLE = {Automatic Document Logo Detection}, BOOKTITLE = {In Proc. 9th Int. Conf. Document Analysis and Recognition (ICDAR 2007)}, PAGES = {864--868}, YEAR = {2007}, }

Tobacco 800 Dataset13-09-2019 (v. 1)by David Doermann

Scene Text

 

Figure 1. Typical images from MSRA-TD500. Notice the red rectangles. They indicate the texts within them are labelled as difficult (due to blur or occlusion).

Please check the Task page to download the dataset.

 

The MSRA Text Detection 500 Database (MSRA-TD500) is collected and released publicly as a benchmark to evaluate text detection algorithms, for the purpose of tracking the recent progresses in the field of text detection in natural images, especially the advances in detecting texts of arbitrary orientations.

The MSRA Text Detection 500 Database (MSRA-TD500) contains 500 natural images, which are taken from indoor (office and mall) and outdoor (street) scenes using a pocket camera. The indoor images are mainly signs, doorplates and caution plates while the outdoor images are mostly guide boards and billboards in complex background. The resolutions of the images vary from 1296x864 to 1920x1280.

The dataset is challenging because of both the diversity of the texts and the complexity of the background in the images. The text may be in different languages (Chinese, English or mixture of both), fonts, sizes, colors and orientations. The background may contain vegetation (e.g. trees and bushes) and repeated patterns (e.g. windows and bricks), which are not so distinguishable from text.

The dataset is divided into two parts: training set and test set. The training set contains 300 images randomly selected from the original dataset and the remaining 200 images constitute the test set. All the images in this dataset are fully annotated. The basic unit in this dataset is text line (see Figure 1) rather than word, which is used in the ICDAR datasets, because it is hard to partition Chinese text lines into individual words based on their spacing; even for English text lines, it is non-trivial to perform word partition without high level information.

MSRA Text Detection 500 Database13-01-2014 (v. 1)by Cong Yao
avg.: 4.4 by 12 users
Ground Truth: GT for MSRA Text Detection 500 Database13-01-2014 (v. 1)by Cong Yao
avg.: 3.8 by 9 users
Task: Text Detection in Natural Images13-01-2014 (v. 1)by Cong Yao

Electronic Documents

Tables

SOURCE WEBSITES:

Statistics Canada  http://www.statcan.gc.ca/start-debut-eng.html

The World Bank,   http://www.worldbank.org/

Statistics Norway,    https://www.google.com/?gws_rd=ssl#q=Statistics+Norway

Statistics Finland, http://www.stat.fi/index_en.html

US Department of Justice,    https://www.justice.gov/

Geohive,   http://www.geohive.com/

US Energy Information Administration,    https://www.eia.gov/

US Census Bureau. http://www.census.gov/

 DATA COLLECTION:  About 1000 tables were collected from international statistical websited by DocLab graduate students in2009-2010. A perceptually random subset of 200 of these tables was converted (mostly from HTML) via EXCEL into CSV files and stored at DocLab. The original HTML files were not retained and their URLs were kept somewhat haphazardly. Many of these tables can still be found by a web search, but others have been modified, corrected, or updated on the original sites. The dataset now consists only of the 200 CSV file representation of 200 web tables.

GROUND TRUTH: The four critical cells (top-left and bottom right of the stub-header and top-left and bottom-right of the data region) were entered with an interactive tool (VeriClick) in 2011. Several GT errors were subsequently found during segmentation experiments and corrected. Researchers may still disagree on 2-3 tables on whether a particular row or column (unusual contents or units) belong to the data region.  The GT for all 200 tables is a single CSV file that allows adding to it and verifying the results of a header segmentation experiment.

 

TANGO-DocLab web tables from international statistical sites16-03-2016 (v. 1)by George Nagy
Ground Truth: Critical cells for table header segmentation16-03-2016 (v. 1)by George Nagy
Task: table segmentation16-03-2016 (v. 1)by George Nagy

Handwritten Documents

The dataset provide more than 11 000 expressions handwritten by hundreds of writers from different countries, merging the data sets from 4 CROHME competitions. Writers were asked to copy printed expressions from a corpus of expressions. The corpus has been designed to cover the diversity proposed by the different tasks and chosen from an existing math corpus and from expressions embedded in Wikipedia pages. Different devices have been used (different digital pen technologies, white-board input device, tablet with sensible screen), thus different scales and resolutions are used. The dataset provides only the on-line signal.

In the last competition CROHME 2013 the test part is completely original and the train part is using 5 existing data sets:

MathBrush  (University  of  Waterloo),  HAMEX  (University  of  Nantes),  MfrDB  (Czech Technical  University),  ExpressMatch  (University  of Sao Paulo), the KAIST data set.

In CROHME 2014 a new test set has been created with 987 new expressions and 2 new tasks has been added: isolated symbol recognition and matrix recognition. Train and test files as the evaluation scripts for these new tasks are provided. For the isolated symbol datasets, elements are extracted from full expression using the existing datasets, which also includes segmentation errors. For the matrix recognition task, 380 new expressions have been labelled and split into training and test sets.

Furthermore, 6 participants of the 2012 competition provide their recognized expressions for the 2012 test part. This data allows research on decision fusion or evaluation metrics...

ICFHR 2014 CROHME: Fourth International Competition on Recognition of Online Handwritten Mathematical Expressions16-02-2015 (v. 2)by Harold Mouchère
Ground Truth: CROHME16-02-2015 (v. 1)by Harold Mouchère
Task: Mathematical Expression Recognition16-02-2015 (v. 1)by Harold Mouchère
Task: Isolated Mathematical Symbol Recognition16-02-2015 (v. 1)by Harold Mouchère
Task: Matrix Recognition16-02-2015 (v. 1)by Harold Mouchère

Off-line

This is the dataset of the ICDAR 2013 - Gender Identification from Handwriting competition. If you use this database, please consider citing it as in [1].

This dataset is a subset of the QUWI dataset [2]. In sum, a total of 475 writers produced 4 handwritten documents: the first page contains an Arabic handwritten text which varies from one writer to another, the second page contains an Arabic handwritten text which is the same for all the writers, the third page contains an English handwritten text which varies from one writer to another and the fourth page contains an English handwritten text which is the same for all the writers.

Images have been acquired using an EPSON GT-S80 scanner, with a 300 DPI resolution. Images were provided in
JPG uncompressed format. The training set consists of the first 282 writers for which the genders are provided.
Participants were asked to predict the gender of the remaining 193 writers.

In addition to images, features extracted from the data were also provided in order to make the competition accessible by people without image processing skills. Those features are described in [3].

[1] Hassaïne, Abdelâali, et al. "ICDAR 2013 Competition on Gender Prediction from Handwriting." Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013.

[2] Hassaïne, Abdelaali, et al. "The ICDAR2011 Arabic writer identification contest." Document Analysis and Recognition (ICDAR), 2011 International Conference on. IEEE, 2011.

[3] Al Maadeed, Somaya, and Abdelaali Hassaine. "Automatic prediction of age, gender, and nationality in offline handwriting." EURASIP Journal on Image and Video Processing 2014.1 (2014): 1-10.

 

ICDAR 2013 - Gender Identification Competition Dataset25-01-2015 (v. 1)by Abdelaali Hassaine
Ground Truth: Genders of all writers25-01-2015 (v. 1)by Abdelaali Hassaine
Task: Gender identification using all documents25-01-2015 (v. 1)by Abdelaali Hassaine
ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records29-08-2019 (v. 1)by Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki
Ground Truth: Handwritten Character Recognition on extracted textlines29-08-2019 (v. 1)by Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki
Ground Truth: Layout Analysis on structured historical document images29-08-2019 (v. 1)by Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki
Ground Truth: Complete, integrated textline detection and recognition on a large dataset29-08-2019 (v. 1)by Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki
Task: Handwritten Character Recognition on extracted textlines29-08-2019 (v. 1)by Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki
Task: Layout Analysis on structured historical document images29-08-2019 (v. 1)by Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki
Task: Complete, integrated textline detection and recognition on a large dataset29-08-2019 (v. 1)by Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki

Before starting handwritten data collection, the Malayalam character classes are decided based on the unique orthographic structures in the Malayalam language script. 85 Malayalam character classes representing vowels, consonants, half-consonants, vowel modifiers, consonant modifiers and conjunct characters that are frequently used while writing is considered for database creation. For collecting character images, the writers are instructed to write the considered Malayalam character classes on pages five times using ballpoint pens by keeping attention on space between each written character. No restriction is kept on the type or quality of the paper and the ballpoint pen used for
writing.

The handwritten data collected from 77 (60 Female and 17 Male) native Malayalamwriters between 20 to 55 age groups and all the writers have minimum graduation as the educational qualification. The learning and testing data are divided based on the writers rather than the collected images. Among 77 writers, the handwritten data collected from 59 persons considered for creating learning (training and validation) data while handwritten data from the remaining 18 persons considered for creating the testing data.  

Fast global minimization algorithm for active contour models(ACM-FGM)employed for detecting the character objects in the collected document images.  For converting the resultant image to a binary representation, Otsu’s global image threshold algorithm is used. Each image is converted to 32*32 dimension.

 

Malayalam Character Image Database21-07-2019 (v. 1)by Manjusha K
Ground Truth: Labels for the Character Images21-07-2019 (v. 1)by Manjusha K
Task: Character Recognition for Malayalam Document Images21-07-2019 (v. 1)by Manjusha K

This dataset contains 15 historical and old manuscript images collected from the historical records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation including bleed-through, faded ink, and blur. The dataset is the first in a series to provide document images and their ground truth as a contribution to Document image analysis and recognition (DAIR) community. It is planned to increase the dataset in future and to create a dataset which also covers the tasks of understanding in the near future.

References

[Ziaei2013] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam, and Mohamed Cheriet. Persian historical document dataset with introduction to PhaseGT: A ground truthing application, to be submitted to ICDAR’13. [Ziaei2012] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam and Mohamed Cheriet, Historical Document Binarization Based on Phase Information of Images, in ACCV’12 Workshop on e-Heritage, Daejeon, South Korea, Nov 5-10, 2012. [Farrahi2009] Reza Farrahi Moghaddam, and Mohamed Cheriet, RSLDI: Restoration of single-sided low-quality document images, Pattern Recognition, Volume 42, Issue 12, p.3355–3364 (2009) DOI: 10.1016/j.patcog.2008.10.021 [Farrahi2010] Reza Farrahi Moghaddam, and Mohamed Cheriet, A multi-scale framework for adaptive binarization of degraded document images, Pattern Recognition, Volume 43, Issue 6, Number 6, p.2186–2198 (2010) DOI: 10.1016/j.patcog.2009.12.024 [Cheriet2012] Mohamed Cheriet, Reza Farrahi Moghaddam, and Rachid Hedjam, A learning framework for the optimization and automation of document binarization methods, Computer Vision and Image Understanding, Volume Accepted, p.– (2012) DOI: 10.1016/j.cviu.2012.11.003

 

 

Persian Heritage Image Binarization Dataset (PHIBD 2012)18-07-2017 (v. 1)by Hossein Ziaie Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet
Ground Truth: Binarized images for PHIBD 2012 dataset03-08-2018 (v. 1)by Hossein Ziaie Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet
Task: Binarization of PHIBD 2012 dataset03-08-2018 (v. 1)by Hossein Ziaie Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet

On-line

ICFHR 2016 Competition on Recognition of On-line Handwritten Mathematical Expressions18-07-2017 (v. 1)by Harold Mouchère
Ground Truth: CROHME18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Formulas18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Symbols18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Structure18-07-2017 (v. 1)by Harold Mouchère
Task: CROHME2016-Matrices18-07-2017 (v. 1)by Harold Mouchère

On-line and Off-line

Keywords

Forensic analysis, signatures, handwriting, off-line, on-line, verification, identification, evaluation, likelihood ratios

Description

 

Objectives

The objective of this competition is to allow researchers and practitioners from academia and industries to compare their performance in signature verification on new unpublished forensic-like datasets (Bengali, German, and Italian). Skilled forgeries and genuine signatures were collected while writing on a paper in some cases it was attached to a digitizing tablet. The collected signature data are available in an offline format and some signatures are also available in online format. Participants can choose to compete on the online data or offline data only, or can choose to combine both data formats.

Similar to the last ICDAR competition, our aim is to compare different signature verification algorithms systematically for the forensic community, with the objective to establish a benchmark on the performance of such methods (providing new unpublished forensic-like datasets with authentic and skilled forgeries in both on- and offline format).

Background

The Forensic Handwriting Examiner (FHE) weighs the likelihood of the observations given (at least) two hypotheses:

H1: The questioned signature is an authentic signature of the reference writer; H2: The questioned signature is written by a writer other than the reference writer;

The interpretation of the observed similarities/differences in signature analysis is not as straightforward as in other forensic disciplines such as DNA or fingerprint evidence, because signatures are a product of a behavioral process that can be manipulated by the reference writer himself, or by another person than the reference writer. In this competition, we ask to produce two probability scores: The probability of observing the evidence (e.g. a certain similarity score) given H1 is true, and the probability of observing the evidence given H2 is true. In this competition only such cases of H2 exist, where the forger is not the reference writer. This continues the last successful competitions on ICDAR 2009, 2011, and 2013.

 

 

 

ICDAR2015 Competition on Signature Verification and Writer Identification for On- and Off-line Skilled Forgeries04-12-2017 (v. 1)by Muhammad Imran Malik

Mixed Content Documents