Ground Truth dataset for French typewritten OCR
154 pairs of images and PAGE XML files divided into 9 sub-corpus.
- (3): images are blurred (low quality), but readable none the less.
Unfortunately, I did not keep track of the URL related to the images taken at random from Europeana...
Undescored words are preceded with _
such as "This is an example" will be transcribed as: "_This _is _an example".
Portions of text that are superscripted are preceded with ^
such as "1er" will be transcribed as "1^er". If several words are superscripted, each word starts with a "^".
Crossed out words are not rendered:
- words that be read under the stroke are transcribed as if not crossed out;
- words that cannot be read under the stroke are transcribed like any portion of text that is not type-written.
Any portion of text that is not type-written is transcribed as a series of ~~~
(always 3). There are as many repetition of ~~~
(with a space between each instance) as there are words.
This dataset was built and is maintained by Alix Chagué (@alix-tz). The original works and their digitization are all copyright-free, but properly annotating a corpus takes time and is a task that should be recognized. If you use any item from this corpus of ground truth, cite the dataset using the following information:
name: 'Tapuscorpus'
url: 'https://github.com/HTR-United/tapuscorpus'
author: 'Alix Chagué'
month: 'janv'
year: '2021'
version: '{any version}'
description: 'Ground Truth dataset for French typewritten OCR (20th century documents)'
language: 'French'
time: '1900-1999'
hands: '30'
license:
- {name: 'CC-BY 4.0', url: 'https://creativecommons.org/licenses/by/4.0/'}
format: PAGE-XML
volume:
- {count: "150", metric: pages}
This work is licensed under a Creative Commons Attribution 4.0 International License.