VML-HD: The Historical Arabic Documents Dataset for Recognition Systems (VML-HD)

2018-06-12 (v. 1)

Majeed Kassis

Ben-Gurion University



This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
You can cite this dataset as: Majeed Kassis, VML-HD: The Historical Arabic Documents Dataset for Recognition Systems (VML-HD) ,1,ID:VML-HD_1,URL:http://tc11.cvc.uab.es/datasets/VML-HD_1

Dataset Information

arabic, subwords, segmentation-based, segmentation-free, word-spotting, recognition


A new database with handwritten Arabic script. It is based on five books written by different writers from the years 1088-1451. We took 668 pages from these five books, and fully annotated them on the sub-word level. For each page we manually applied bounding boxes on the different sub-words and annotated the sequence of characters. It consists of 159,149 sub-word appearances consisted of 326,289 characters out of a vocabulary of 5,509 forms of sub-words. The database is described in detail and is designed for training and testing recognition systems for handwritten Arabic sub-words. This database is available for the purpose of research, and we encourage researchers to develop and test new methods using our database.

For tutorials to understand the annotation format, extraction, and segmentation of this dataset, please refer here. In order to get more information regarding the dataset click here, and to read about the basic tracks in order to conduct research on then click here.

Technical Details

Hadara Format

In the database, each manuscript of the five books can be found in its own folder. Inside the manuscript folder a file nameddocElementsXml.ashx contains the complete annotation information of the said manuscript.

HadaraXML format contains two main tags image tag, and content tag. image meta information consist of bounding box information for each manuscript and its corresponding pages. content contain labeling information for each one of the bounding boxes found under image tag.

Image Tag

  • <image id="781" src="0003-1"></image>
    • src denotes the image file name. Image format is png
    • all annotated subwords of image src are found under this tag.
  • <zone id="113804"> </zone>
    • each zone contains one bounding box of a subword with id.
  • <polygon></polygon>
    • each zone has one polygon and inside it.
    • each polygon the four coordinates of the bounding box denoted by a point tag.
  • <point y="324" x="764" />
    • y denotes the row coordinate of the subword bounding box.
    • x denotes the column coordinate of the subword bounding box.

Content Tag

  • <content image_id="781"></content>
    • each image_id value corresponds to a value of id for the image tag mentioned above.
    • each bounding box found under image tag with same id value has one segment containing transcribtion.
  • <section type="page"></section>
    • inside this section, all segments can be found.
  • <segment id="113804" ref_id="113804">
    • each segment has a ref_id that corresponds to one zone id value.
    • inside this tag, transcription information are found.
  • <transcriptionInfo id="113804"/>
    • inside this, a transcription is present that contains the labeling information of the bounding box.

Example XML

<HADARA> <document nbpages="196" id="61"> <image id="781" src="0003-1"> <page> <zone id="113804"> <polygon> <point y="324" x="764" /> <point y="324" x="821" /> <point y="391" x="821" /> <point y="391" x="764" /> </polygon> </zone> <zone id="113805"> <polygon> <point y="332" x="831" /> <point y="332" x="839" /> <point y="374" x="839" /> <point y="374" x="831" /> </polygon> </zone> . . . <zone id="113808"> <polygon> <point y="318" x="717" /> <point y="318" x="744" /> <point y="384" x="744" /> <point y="384" x="717" /> </polygon> </zone> </page> </image> <content image_id="781"> <section type="page"> <segment id="113804" ref_id="113804"> <transcriptionInfo id="113804"/> <transcription>لم</transcription> </segment> <segment id="113805" ref_id="113805"> <transcriptionInfo id="113805"/> <transcription>ا</transcription> </segment> . . . <segment id="113808" ref_id="113808"> <transcriptionInfo id="113808"/> <transcription>ذ</transcription> </segment> </section> </content> </document> </HADARA>

XML Format

XML Format is a simple format. For each page png in a manuscript, one xml file is present with the same name.

It is basically an array of elements. Each subword information is stored under one element.

Important Tags

  • <ArrayOfDocumentElement>
    • main tag, contains all bounding box information.
  • <DocumentElement>
    • one bounding box is stored under one element.
  • <X>764</X>
    • top left column coordinate.
  • <Y>324</Y>
    • top left row coodinate.
  • <Width>57</Width>
    • width of bounding box.
  • <Height>67</Height>
    • height of bounding box.
  • <Transcript>لم</Transcript>
    • transcription value for the bounding box.

Example XML

<ArrayOfDocumentElement> <DocumentElement> <ID>113804</ID> <ParentID xsi:nil="true" /> <ElementType>PartOfWord</ElementType> <X>764</X> <Y>324</Y> <Width>57</Width> <Height>67</Height> <Transcript>لم</Transcript> <Threshold>100</Threshold> <OriginX>806</OriginX> <OriginY>377</OriginY> </DocumentElement> <DocumentElement> <ID>113805</ID> <ParentID xsi:nil="true" /> <ElementType>PartOfWord</ElementType> <X>831</X> <Y>332</Y> <Width>8</Width> <Height>42</Height> <Transcript>ا</Transcript> <Threshold>100</Threshold> <OriginX>835</OriginX> <OriginY>350</OriginY> </DocumentElement> . . . <DocumentElement> <ID>113808</ID> <ParentID xsi:nil="true" /> <ElementType>PartOfWord</ElementType> <X>717</X> <Y>318</Y> <Width>27</Width> <Height>66</Height> <Transcript>ذ</Transcript> <Threshold>100</Threshold> <OriginX>736</OriginX> <OriginY>375</OriginY> </DocumentElement>


