Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error when trying to teach network on the base of the default model #227

Open
vlad-wonderkidstudio opened this issue Jun 13, 2017 · 5 comments

Comments

@vlad-wonderkidstudio
Copy link

vlad-wonderkidstudio commented Jun 13, 2017

Expected Behavior

If I try to teach the network on the base of the existing model, it should work fine

Current Behavior

If I try to teach the network on the base of the existing model, I always get the following error
(sometimes just after launching the app, sometimes in 10-30 seconds)

Traceback (most recent call last):
  File "/usr/local/bin/ocropus-rtrain", line 289, in <module>
    pcs = network.trainSequence(line,cs,update=do_update,key=fname)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 902, in trainSequence
    self.targets = array(make_target(cs,self.No))
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 734, in make_target
    result[2*i+1,j] = 1.0
IndexError: index 156 is out of bounds for axis 1 with size 156

Possible Solution

Steps to Reproduce (for bugs)

  1. Run command ocropus-rtrain --load models/en-default.pyrnn.gz -o my_new_model_name ground/????/*.bin.png
  2. Get an error (sometimes right after executing the command, sometimes in 10-30 seconds)

Your Environment

  • Python version: Python 2.7.6
@harinath141
Copy link

Try this command:
ocropus-rtrain --load models/en-default.pyrnn.gz -o my_new_model_name ground/????/*.bin.png -S 100 -F200

@Beckenb
Copy link
Contributor

Beckenb commented Jun 19, 2017

This error occurs when you load an existing model and try to teach it characters it has never seen before. In other words, the codec (or chars.py) you are using has characters in it that are not included in the initial training of the en-default model. As far as I know there is no way to expand the codec/character set once training has started.

@jze
Copy link
Contributor

jze commented Aug 14, 2017

I have been able to reproduce the problem with a character included in the existing model's codec. Try to continue training the en-default model with this image:
473 nrm

./ocropus-rtrain --load models/en-default.pyrnn.gz -o test 29265260-aea06d62-80e0-11e7-99f3-d0e061cec2a0.png

The resulting error is IndexError: index 156 is out of bounds for axis 1 with size 156
Or doesn't the en-default model use the default codec?

@Beckenb
Copy link
Contributor

Beckenb commented Oct 23, 2017

I can't say what codec was used to train the en-default model (I don't know if there is any way at all?) - but I highly doubt that they used the default codes, since it includes german and french characters.

@mcriggs
Copy link

mcriggs commented May 10, 2018

I had a similar issue. To solve this issue for myself, I first interpreted the "size 156" in the error message as a reference to the size of the codec with which the en-default model was trained. This might not be correct, I understand. I also noticed that the size of the default codec in the my chars.py file was larger than 156 characters. I experimented with various character sets in the chars.py file and found that using the following codec of 156 characters while training on the en-default model resolved the issue:

~!"#$%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz[]^_`abcdefghijklmnopqrstuvwxyz{|}¡¢£§©«®°¶»¿ÀÂÄÆÇÈÉÊËÎÏÔÖÙÛÜßàâäæçèéêëîïôö÷ùûüÿŒœŸ†‡•‣‹›€∙▪▫

(N.B. other similar sets of 156 characters may seem to work, but deletions of substitutions of these characters will, I found, lead to remappings of characters in the ocropus-rpred output. For example, an "è" might be consistently output as an "û" or something like that.)

The chars.py file I now use for training on the en-default model is as follows:

digits = u"0123456789"
letters = u"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
symbols = ur"""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"""
ascii = digits+letters+symbols
xsymbols = u"""€¢£»«›‹÷©®†‡°∙•‣¶§÷¡¿▪▫"""
mychars = u"ÀÂÄÆÇÈÉÊËÎÏÔÖÙÛÜßàâäæçèéêëîïôöùûüÿŸŒœ"
default = ascii+xsymbols+mychars
european = default

Changing the chars.py file in this way resolved this issue, at least for me. I have had no further problems training new models on the en-default model since.

JZE, all the characters in your image of the text "Mückendorf 167. 4." are admissible. It is likely however that due to the additional characters in the default codec of the chars.py file you were using the ü was pushed beyond the limit of the 156 character codec and thus caused an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants