Skip to content

poke1024/origami_models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

What is this?

This is Origami's default OCR model that was trained to perform OCR of the Berliner Börsen-Zeitung. It was trained with about 10,000 lines of ground truth and using Origami's line extraction which produces dewarped non-binarized lines that tend to be somewhat wider than the lines extracted with other OCR pipelines - Origami scales lines only vertically but not horizontally.

The model was trained to understand both Antiqua and Fraktur and detect two formatting styles: bold text is marked with brackets [], wide text (Sperrtext) is marked with curly brackets {}. The model was also trained to understand integer fractions (to some degree). Integer fractions are encoded as <n/m> where n is the numerator and d is the denominator.

Based on a limited evaluation, we found the CER to be at about 0.5% and the WER at about 1.7%.

Also see the examples below to see how style markers are encoded.

Details of the training process are described in the paper On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter Evaluation.

Ground Truth

The ground truth that was used to train these models can be found under

https://www.dropbox.com/sh/mgsopnami242i8u/AAByAKVmdMACiQK72jhLLQ2Ka?dl=0

The data above contains page images, region and segmentation information (produced using Origami) and finally an annotations.db with transcriptions (also for use with Origami).

To generate pairs of line image and texts you need to use Origami's export command (you can specify custom line height for export and binarization methods used). Alternatively you can use the following example export:

https://www.dropbox.com/s/0tvsgkh1a12xpxl/annotations-v19-h56.zip?dl=0

Model Files

The Calamari model files are too large for this repository. They can be downloaded under

https://www.dropbox.com/sh/3e5y9vb41dggttj/AAAE2DrNfpWNz6ygIlICX3cMa?dl=0

Examples

Antiqua with wide style

Example Line

Abg. {Hasenclever}: Es zeigt sich auch dies-

Antiqua with bold style

Example Line

[Konstantinopel], 2. Februar. (H. T. B.)

Fraktur with both styles

Example Line

[Dortmund], 3. April. (Privat-Depesche der}

Integer Fractions

Example Line

schen Gesellschaft für Eisenbahn-Betriebs-Material 1<7/8> %; die Actien der

About

Models that were trained for the Origami BBZ project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published