Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for traineddata files in standard archive formats #2290

Merged
merged 1 commit into from
Mar 5, 2019

Conversation

stweil
Copy link
Contributor

@stweil stweil commented Mar 5, 2019

This requires libarchive-dev.

Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:

$ unzip -l /usr/local/share/tessdata/zip.traineddata
Archive:  /usr/local/share/tessdata/zip.traineddata
  Length      Date    Time    Name
---------  ---------- -----   ----
       55  2019-03-05 15:27   bagit.txt
        0  2019-03-05 15:25   data/
     1557  2019-03-05 15:28   manifest-sha256.txt
  1082890  2019-03-05 15:25   data/eng.word-dawg
  1487588  2019-03-05 15:25   data/eng.lstm
     7477  2019-03-05 15:25   data/eng.unicharset
    63346  2019-03-05 15:25   data/eng.shapetable
   976552  2019-03-05 15:25   data/eng.inttemp
    13408  2019-03-05 15:25   data/eng.normproto
     4322  2019-03-05 15:25   data/eng.punc-dawg
     4738  2019-03-05 15:25   data/eng.lstm-number-dawg
     1410  2019-03-05 15:25   data/eng.freq-dawg
      844  2019-03-05 15:25   data/eng.pffmtable
     6360  2019-03-05 15:25   data/eng.lstm-unicharset
     1012  2019-03-05 15:25   data/eng.lstm-recoder
     1047  2019-03-05 15:25   data/eng.unicharambigs
     4322  2019-03-05 15:25   data/eng.lstm-punc-dawg
 16109842  2019-03-05 15:25   data/eng.bigram-dawg
       80  2019-03-05 15:25   data/eng.version
     6426  2019-03-05 15:25   data/eng.number-dawg
  3694794  2019-03-05 15:25   data/eng.lstm-word-dawg
---------                     -------
 23468070                     21 files

combine_tessdata -d and combine_tessdata -u also work.

The traineddata files in the new format can be generated with
standard tools like zip or tar.

More work is needed for other training tools and big endian support.

Signed-off-by: Stefan Weil [email protected]

This requires libarchive-dev.

Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:

    $ unzip -l /usr/local/share/tessdata/zip.traineddata
    Archive:  /usr/local/share/tessdata/zip.traineddata
      Length      Date    Time    Name
    ---------  ---------- -----   ----
           55  2019-03-05 15:27   bagit.txt
            0  2019-03-05 15:25   data/
         1557  2019-03-05 15:28   manifest-sha256.txt
      1082890  2019-03-05 15:25   data/eng.word-dawg
      1487588  2019-03-05 15:25   data/eng.lstm
         7477  2019-03-05 15:25   data/eng.unicharset
        63346  2019-03-05 15:25   data/eng.shapetable
       976552  2019-03-05 15:25   data/eng.inttemp
        13408  2019-03-05 15:25   data/eng.normproto
         4322  2019-03-05 15:25   data/eng.punc-dawg
         4738  2019-03-05 15:25   data/eng.lstm-number-dawg
         1410  2019-03-05 15:25   data/eng.freq-dawg
          844  2019-03-05 15:25   data/eng.pffmtable
         6360  2019-03-05 15:25   data/eng.lstm-unicharset
         1012  2019-03-05 15:25   data/eng.lstm-recoder
         1047  2019-03-05 15:25   data/eng.unicharambigs
         4322  2019-03-05 15:25   data/eng.lstm-punc-dawg
     16109842  2019-03-05 15:25   data/eng.bigram-dawg
           80  2019-03-05 15:25   data/eng.version
         6426  2019-03-05 15:25   data/eng.number-dawg
      3694794  2019-03-05 15:25   data/eng.lstm-word-dawg
    ---------                     -------
     23468070                     21 files

`combine_tessdata -d` and `combine_tessdata -u` also work.

The traineddata files in the new format can be generated with
standard tools like zip or tar.

More work is needed for other training tools and big endian support.

Signed-off-by: Stefan Weil <[email protected]>
@stweil
Copy link
Contributor Author

stweil commented Mar 5, 2019

Support for the CMake based build is still missing, so use autoconf to use the new feature.

@zdenop zdenop merged commit 868a623 into tesseract-ocr:master Mar 5, 2019
@ghost ghost removed the review label Mar 5, 2019
@stweil stweil deleted the libarchive branch March 5, 2019 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants