VML-HD: The Historical Arabic Documents Dataset for Recognition Systems (VML-HD)

Ground Truth

Subword level annotation for the VML-HD Dataset

2018-06-12 (v. 1)

Contact author

Majeed Kassis

Ben-Gurion University



This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.


recognition, word spotting, detection, subword


Hadara Format

In the database, each manuscript of the five books can be found in its own folder. Inside the manuscript folder a file nameddocElementsXml.ashx contains the complete annotation information of the said manuscript.

HadaraXML format contains two main tags image tag, and content tag. image meta information consist of bounding box information for each manuscript and its corresponding pages. content contain labeling information for each one of the bounding boxes found under image tag.

Image Tag

  • <image id="781" src="0003-1"></image>
    • src denotes the image file name. Image format is png
    • all annotated subwords of image src are found under this tag.
  • <zone id="113804"> </zone>
    • each zone contains one bounding box of a subword with id.
  • <polygon></polygon>
    • each zone has one polygon and inside it.
    • each polygon the four coordinates of the bounding box denoted by a point tag.
  • <point y="324" x="764" />
    • y denotes the row coordinate of the subword bounding box.
    • x denotes the column coordinate of the subword bounding box.

Content Tag

  • <content image_id="781"></content>
    • each image_id value corresponds to a value of id for the image tag mentioned above.
    • each bounding box found under image tag with same id value has one segment containing transcribtion.
  • <section type="page"></section>
    • inside this section, all segments can be found.
  • <segment id="113804" ref_id="113804">
    • each segment has a ref_id that corresponds to one zone id value.
    • inside this tag, transcription information are found.
  • <transcriptionInfo id="113804"/>
    • inside this, a transcription is present that contains the labeling information of the bounding box.

Example XML

<HADARA> <document nbpages="196" id="61"> <image id="781" src="0003-1"> <page> <zone id="113804"> <polygon> <point y="324" x="764" /> <point y="324" x="821" /> <point y="391" x="821" /> <point y="391" x="764" /> </polygon> </zone> <zone id="113805"> <polygon> <point y="332" x="831" /> <point y="332" x="839" /> <point y="374" x="839" /> <point y="374" x="831" /> </polygon> </zone> . . . <zone id="113808"> <polygon> <point y="318" x="717" /> <point y="318" x="744" /> <point y="384" x="744" /> <point y="384" x="717" /> </polygon> </zone> </page> </image> <content image_id="781"> <section type="page"> <segment id="113804" ref_id="113804"> <transcriptionInfo id="113804"/> <transcription>لم</transcription> </segment> <segment id="113805" ref_id="113805"> <transcriptionInfo id="113805"/> <transcription>ا</transcription> </segment> . . . <segment id="113808" ref_id="113808"> <transcriptionInfo id="113808"/> <transcription>ذ</transcription> </segment> </section> </content> </document> </HADARA>

XML Format

XML Format is a simple format. For each page png in a manuscript, one xml file is present with the same name.

It is basically an array of elements. Each subword information is stored under one element.

Important Tags

  • <ArrayOfDocumentElement>
    • main tag, contains all bounding box information.
  • <DocumentElement>
    • one bounding box is stored under one element.
  • <X>764</X>
    • top left column coordinate.
  • <Y>324</Y>
    • top left row coodinate.
  • <Width>57</Width>
    • width of bounding box.
  • <Height>67</Height>
    • height of bounding box.
  • <Transcript>لم</Transcript>
    • transcription value for the bounding box.

Example XML

<ArrayOfDocumentElement> <DocumentElement> <ID>113804</ID> <ParentID xsi:nil="true" /> <ElementType>PartOfWord</ElementType> <X>764</X> <Y>324</Y> <Width>57</Width> <Height>67</Height> <Transcript>لم</Transcript> <Threshold>100</Threshold> <OriginX>806</OriginX> <OriginY>377</OriginY> </DocumentElement> <DocumentElement> <ID>113805</ID> <ParentID xsi:nil="true" /> <ElementType>PartOfWord</ElementType> <X>831</X> <Y>332</Y> <Width>8</Width> <Height>42</Height> <Transcript>ا</Transcript> <Threshold>100</Threshold> <OriginX>835</OriginX> <OriginY>350</OriginY> </DocumentElement> . . . <DocumentElement> <ID>113808</ID> <ParentID xsi:nil="true" /> <ElementType>PartOfWord</ElementType> <X>717</X> <Y>318</Y> <Width>27</Width> <Height>66</Height> <Transcript>ذ</Transcript> <Threshold>100</Threshold> <OriginX>736</OriginX> <OriginY>375</OriginY> </DocumentElement>


No comments on this dataset yet.

Add your comment

In order to comment on a dataset you need to be logged on
Register Now!


In order to rate this dataset you need to be logged on
Register Now!