Skip to content

Training an HTR model to recognize 19th century German Kurrent script with Calamari.

Notifications You must be signed in to change notification settings

MGJamJam/calamari_kurrent_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Calamari Kurrent Model

This repository contains the practical part of my Bachelor thesis. The goal is to train and evaluate an Handwritten Text Recognition (HTR) model using the OCR engine calamari to recognizes 19th-century German Kurrent script.

Ground Truth Data

Protokolle des Akademischen Senats (1799-1847)

The Senatsprotokolle folder contains transcriptions of the "Protokolle des Akademischen Senats" from Eberhard Karls Universität Tübingen during the period of 1799-1847.

The data was sourced from https://github.com/ubtue/Ground-Truth/tree/main/Senatsprotokolle

Folder Structure

  • AnnotatedImages: Contains JPEG images with the textline layout outlined in red created using the LineExtractor tool
  • extracted_lines_UAT_047_15.zip: Contains all 813 textline images along with their respective PAGE XML files, extracted from the original Images and PageXML files using the LineExtractor tool.
  • extracted_lines_UAT_047_19.zip: Contains all 696 textline images along with their respective PAGE XML files, extracted from the original Images and PageXML files using the LineExtractor tool.
  • extracted_lines_UAT_047_19.zip: Contains all 715 textline images along with their respective PAGE XML files, extracted from the original Images and PageXML files using the LineExtractor tool.
  • extracted_lines_UAT_047_22.zip: Contains all 692 textline images along with their respective PAGE XML files, extracted from the original Images and PageXML files using the LineExtractor tool.
  • extracted_lines_UAT_047_24.zip: Contains all 1085 textline images along with their respective PAGE XML files, extracted from the original Images and PageXML files using the LineExtractor tool.
  • extracted_lines_UAT_047_25.zip: Contains all 854 textline images along with their respective PAGE XML files, extracted from the original Images and PageXML files using the LineExtractor tool.
  • extracted_lines_UAT_047_28_1.zip: Contains 1799 textline images along with their respective PAGE XML files, extracted from the original Images and PageXML files using the LineExtractor tool.
  • extracted_lines_UAT_047_28_2.zip: Contains 1862 textline images along with their respective PAGE XML files, extracted from the original Images and PageXML files using the LineExtractor tool.

Dataset Overview

  • Pages: 229 pages from seven volumes.
  • Textlines: 8_516 textlines.
  • Words: 37_313 words.
  • Characters: 235_612 characters.

Script and Layout

  • Script: Mostly Kurrent by different scribes.
  • Layout: Text regions and baselines are manually corrected.

Transcription guidelines

All transcriptions were created using Transkribus. The transcription rules are based on the OCR-D transcription guidelines Level 2

Sources

The transcriptions are based on digitized material available through OpenDigi from the University Library of Tübingen. Below are the specific volumes referenced:

Test Set: Minutes of the Swiss Federal Council (1848-1903)

  • Description: The data set consists of extracts of the minutes of the Swiss Federal Council.
  • Citation: Hodel, T., & Schoch, D. (2021). Handwritten Text Recognition Test Set: Minutes of the Swiss Federal Council (1848-1903) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4746342

About

Training an HTR model to recognize 19th century German Kurrent script with Calamari.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published