A Dataset of French Trade Directories from the 19th Century (FTC)

2022-05-17 (v. 1)

Contact author

Joseph Chazalon

EPITA R&D Lab. (LRDE), France




You can cite this dataset as: Joseph Chazalon, A Dataset of French Trade Directories from the 19th Century (FTC) ,1,ID:FTC_1,URL:https://tc11.cvc.uab.es/datasets/FTC_1

Dataset Information

Dataset URL





Download URL: https://zenodo.org/record/6394464
This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.


The purpose of this dataset is to evaluate the performance of Optical Character Recognition (OCR)


and Named Entity Recognition (NER) on 19th century French documents.

This dataset is divided into two parts:

  1. A labeled dataset, which contains 8765 manually corrected entries from 78 pages (18 different directories), and which is designed for supervised training.
  2. An unlabeled dataset, containing 1058196 raw entries from 6887 pages (13 different directories), and which is designed for self-supervised pre-training.
For the labeled dataset, we provide:
  • Original pages and cropped images
  • Human-corrected positions, transcriptions and entity tagging for each entry
  • OCR prediction from 3 systems (Tesseract v4, PERO OCR v2020 and Kraken)
  • Projected NER reference from clean text to OCR predictions, making it suitable to evaluate the performance of NER systems on real, noisy OCR predictions.

For the unlabeled dataset, we provide:

  • Automatically detected positions for each entry (lot of noise)
  • OCR predictions for each entry (PERO OCR engine)
How to cite this dataset
Please cite this dataset as:
N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.
author = {Abadie, Nathalie and
Bacciochi, St{'e}phane and
Carlinet, Edwin and
Chazalon, Joseph and
Cristofoli, Pascal and
Dum{'e}nieu, Bertrand and
Perret, Julien},
title = {{A} {D}ataset of {F}rench {T}rade {D}irectories from the 19th {C}entury ({FTD})},
month = mar,
year = 2022,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.6394464},
url = {https://doi.org/10.5281/zenodo.6394464}
You may also be interested in our paper presented at DAS 2022 (15th IAPR International Workshop on Document Analysis Systems), which compares the performance of OCR and NER systems on this dataset:
N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu, A Benchmark of Named Entity Recognition Approaches in Historical Documents — Application to 19th Century French Directories, May 2022, La Rochelle, France, Springer.
author = {Abadie, Nathalie and
Carlinet, Edwin and
Chazalon, Joseph and
Dum{'e}nieu, Bertrand},
title = {{A} {B}enchmark of {N}amed {E}ntity {R}ecognition {A}pproaches in {H}istorical {D}ocuments — {A}pplication to 19th {C}entury {F}rench {D}irectories},
month = may,
year = 2022,
publisher = {Springer},
place = {La Rochelle, France}
Copyright and License
The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library). Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.
Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.
Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.
The original contents were significantly transformed before being included in this dataset.
All derived content is licensed under the permissive Creative Commons Attribution 4.0 International license.

No comments on this dataset yet.
In order to rate this dataset you need to be logged onLogin / Register