What is this?

This is Origami's default OCR model that was trained to perform OCR of the Berliner Börsen-Zeitung. It was trained with about 10,000 lines of ground truth and using Origami's line extraction which produces dewarped non-binarized lines that tend to be somewhat wider than the lines extracted with other OCR pipelines - Origami scales lines only vertically but not horizontally.

The model was trained to understand both Antiqua and Fraktur and detect two formatting styles: bold text is marked with brackets [], wide text (Sperrtext) is marked with curly brackets {}. The model was also trained to understand integer fractions (to some degree). Integer fractions are encoded as <n/m> where n is the numerator and d is the denominator.

Based on a limited evaluation, we found the CER to be at about 0.5% and the WER at about 1.7%.

Also see the examples below to see how style markers are encoded.

Details of the training process are described in the paper On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter Evaluation.

Ground Truth

The ground truth that was used to train these models can be found under

https://www.dropbox.com/sh/mgsopnami242i8u/AAByAKVmdMACiQK72jhLLQ2Ka?dl=0

The data above contains page images, region and segmentation information (produced using Origami) and finally an annotations.db with transcriptions (also for use with Origami).

To generate pairs of line image and texts you need to use Origami's export command (you can specify custom line height for export and binarization methods used). Alternatively you can use the following example export:

https://www.dropbox.com/s/0tvsgkh1a12xpxl/annotations-v19-h56.zip?dl=0

Model Files

The Calamari model files are too large for this repository. They can be downloaded under

https://www.dropbox.com/sh/3e5y9vb41dggttj/AAAE2DrNfpWNz6ygIlICX3cMa?dl=0

Examples

Antiqua with wide style

Abg. {Hasenclever}: Es zeigt sich auch dies-

Antiqua with bold style

[Konstantinopel], 2. Februar. (H. T. B.)

Fraktur with both styles

[Dortmund], 3. April. (Privat-Depesche der}

Integer Fractions

schen Gesellschaft für Eisenbahn-Betriebs-Material 1<7/8> %; die Actien der

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

Ground Truth

Model Files

Examples

Antiqua with wide style

Antiqua with bold style

Fraktur with both styles

Integer Fractions

About

Releases

Packages

License

poke1024/origami_models

Folders and files

Latest commit

History

Repository files navigation

What is this?

Ground Truth

Model Files

Examples

Antiqua with wide style

Antiqua with bold style

Fraktur with both styles

Integer Fractions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages