Bullinger Dataset for Writer Adaptation (BullingerDB)

Dataset Information
Keywords
Handwriting Recognition, Writer Adaptation, Historical Documents, Handwritten Letters
Description
The Bullinger dataset is composed of a comprehensive letter correspondence of Heinrich Bullinger (1504-1575), an important Swiss Reformer. He wrote about 2,000 letters and received around 10,000 letters from over 1,000 persons. This correspondance is one of the largest from the 16th century. The letters are written mainly in Latin, but parts are also in Early New High German. Transcriptions are available for 8,600 letters, which can have one or more pages. Automatic line segmentation and transcription alignment were performed using the Transkribus platform. In general, the quality of the alignment is high, and thus the quality of the ground truth for handwriting recognition, but especially at the beginning and at the end of the text lines, errors may arise due to word breaks. Furthermore, the transcription is not necessarily character-accurate, e.g. abbreviations are often written out in full. Certain writers, including Bullinger, exhibit writing styles that are very difficult to read, even for human experts. We can observe a mix of Latin and Early New High German phrases, abbreviations, and words that are very difficult to decipher without intimate knowledge of the handwriting.
For downloading the text line images, please follow the instructions provided here: https://github.com/pstroe/bullinger-htr/blob/main/README.md
To study the impact of writer adaptation, we consider text line images from a subset of 3,622 letters by 306 writers with automatically aligned transcriptions, which are used as ground truth for the handwriting recognition experiments. The dataset contains 8,393 pages with 155,246 lines and 1,241,714 words. The dataset is split in two main categories: frequent writers (who wrote at least 5 letters) and non-frequent writers (who wrote less than 5 letters). For the frequent writers, we use 876,003 lines for training, 122,211 lines for validation (optimization of hyper-parameters), and 115,289 lines for testing. Furthermore, we selected 200 non-frequent writers to compose a second test set of similar size. The dataset splits contain distinct letters, e.g. there are no lines in the test set that appear in a letter present in the training set. In this experimental setup, the test set for frequent writers estimates how well HTR performs for known writers, where several of their letters have been transcribed for training, and the test set for non-frequent writers estimates how well HTR performs for unknown writers, whose writing styles are not present during training. This scenario reflects the real situation in the Bullinger project, where the transcription efforts are directed towards the most important (most frequent) writers.
If you use the Bullinger dataset for writer adaptation in your research, please cite the following paper:
Anna Scius-Bertrand, Phillip Ströbel, Martin Volk, Tobias Hodel, and Andreas Fischer. The Bullinger Dataset: A Writer Adaptation Challenge. ICDAR 2023.
Technical Details
The dataset has the following architecture:
characters.tsv:
- all characters present in the dataset (78 characters in total)
test:
- test_nonfrequent.tsv
- id_line TAB TAB text_line_transcription (all test lines of non-frequent writers)
- test_frequent.tsv
- id_line text_line_transcription (all test lines of frequent writers)
- nonfrequent_writers
- nonfreq_w0.tsv
- id_line TAB TAB text_line_transcription (all test lines for non-frequent writer 0)
- nonfreq_w1.tsv
- ...
- nonfreq_w199.tsv
- nonfreq_w0.tsv
- frequent_writers
- test_w0.tsv
- id_line TAB TAB text_line_transcription (all test lines for frequent writer 0)
- test_w1.tsv
- ...
- test_w105.tsv
- test_w0.tsv
train:
- train_frequent.tsv
- id_line TAB TAB text_line_transcription (all train lines for frequent writers)
- frequent_writers
- train_w0.tsv
- id_line TAB TAB text_line_transcription
- train_w1.tsv
- ...
- train_w105.tsv
- train_w0.tsv
valid
- valid_frequent.tsv
- id_line TAB TAB text_line_transcription (all valid lines for frequent writers)
- frequent_writers
- valid_w0.tsv
- id_line TAB TAB text_line_transcription
- valid_w1.tsv
- ...
- valid_w105.tsv
- valid_w0.tsv
An example for id_line: 10001-12000-out/la/11212_00_r2l37.png
10001-12000: range of the letter number
out/: output (no significance, always the same)
la/: langage: la = Latin ; ge = German (automatic language detection result)
11212: number of the letter
00: number of the page in the letter
r: text region
2: number of the text region
l: line
37: number of the text line
.png: type of the image
For more details on the distribution of writers, letters, lines, and words, we refer to our paper.