Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.

Training

Assaf Urieli edited this page Dec 9, 2016 · 17 revisions

Training involves the automatic construction of an OCR letter guessing model based on a previously annotated training corpus.

To this end, a set of Features is applied to each shape in the training corpus. These generate a formal description of the shape, which is also associated with a letter, since the corpus has already been annotated. The probabilistic classifier then applies a machine learning algorithm to iteratively find a relative weight for each possible feature result and each possible letter. The final set of weights is stored in the OCR model.

In order to train a new OCR model, you need to download and unzip the latest Jochre release.

You also need the following:

  • Java v1.8+
  • PostGreSQL server v9+ (download here)
  • A backup of a training corpus database constructed using JochreWeb

First, you need to restore your backup to your local PostGreSQL database server.

If the backup was created with the following command:

pg_dump -Fc -U nlp -W --no-owner --no-privileges jochreoc > jochreoc_yyyymmdd.dump

The restore can then be performed as follows:

Create the new user (if it does not yet exist);

sudo -u postgres createuser -P nlp
Enter password for new role:
Enter it again:
Shall the new role be a superuser? (y/n) y

Create the new database (you can choose any name you like, "jochreoc" below is for jochre and Occitan):

sudo -u postgres createdb -O nlp -E "UTF8" jochreoc

Restore the database dump:

pg_restore -U nlp -W -d jochreoc jochreoc_yyyymmdd.dump

You will need to correct the file in examples/conf/jochre.conf to point at your new database with the new login credentials. You will also need to correct the locale from "oc" to whatever ISO 639-2 language code you are training for.

It is assumed that certain pages were marked as "Training - Validated" in JochreWeb, and other pages as "Training - Test".

We can now run training as follows:

java -Xmx2G -Dconfig.file=examples/conf/jochre.conf -jar jochre-x.x.x.jar command=train features=examples/features/letters_simple.txt letterModel=data/models/letters_oc_1.zip imageStatus=training logConfigFile=examples/conf/logback.xml

Parameters:

  • features: a file containing feature descriptors to use in training, as described in Features.
  • letterModel: the location of the OCR model to save. Note that you can choose to name the model anything you like.
  • imageStatus: which images to use for training. If you choose "training", it will include all images marked "Training - validated" on JochreWeb. If you choose "all", it will include all images marked "Training - Validated", "Training - Heldout", and "Training - Test". Other options are "test" and "heldOut". You can include several comma-delimited options.
  • docSet (optional): a comma-delimited list of document ids to include in the training corpus.

Note: for large databases, you may need more than 2 Gigabytes - change the parameter from -Xmx2G to -Xmx16G (or however much memory your system can spare).

Changing page status

If for any reason your database copy has the wrong statuses, you can update them locally as follows. Run psql for your database (or open the SQL editor in pgAdminIII).

Run the following query to see the current status of pages in a given set of documents (the document id can be seen on JochreWeb):

select doc_name, page_index, imgstatus_name from ocr_image
inner join ocr_page on image_page_id = page_id
inner join ocr_document on page_doc_id = doc_id
inner join ocr_image_status on image_imgstatus_id = imgstatus_id
where doc_id in (4,5)
order by doc_name, page_index;

Run the following queries to update all pages in a given document:

UPDATE ocr_image SET image_imgstatus_id=(SELECT imgstatus_id FROM ocr_image_status WHERE imgstatus_name='Training - validated')
WHERE image_page_id IN (SELECT page_id FROM ocr_page WHERE page_doc_id=8);

Run the following query to update a page by index (in this case to change pages 14 and 15 of doc id 8):

UPDATE ocr_image SET image_imgstatus_id=(SELECT imgstatus_id FROM ocr_image_status WHERE imgstatus_name='Training - validated')
WHERE image_page_id IN (SELECT page_id FROM ocr_page WHERE page_doc_id=8 AND page_index IN (14,15));

The image statuses that can be used are:

  • 'Training - new' : exclude from training and evaluation
  • 'Training - validated' : use for training
  • 'Training - hold-out' : use for evaluation during model tuning
  • 'Training - test' : use for final evaluation

You can see totals for each status for existing documents using the following command:

select doc_name, imgstatus_name, count(*) from ocr_image
inner join ocr_page on image_page_id = page_id
inner join ocr_document on page_doc_id = doc_id
inner join ocr_image_status on image_imgstatus_id = imgstatus_id
group by doc_name, imgstatus_name
order by doc_name, imgstatus_name;
Clone this wiki locally