Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on using -psm 0 when using fast eng.traineddata #1167

Closed
jbarlow83 opened this issue Oct 9, 2017 · 7 comments
Closed

Segfault on using -psm 0 when using fast eng.traineddata #1167

jbarlow83 opened this issue Oct 9, 2017 · 7 comments
Labels
OSD Orientation and Script Detection

Comments

@jbarlow83
Copy link

jbarlow83 commented Oct 9, 2017

Environment

  • Tesseract Version: 4.00 alpha
  • Commit Number: 1b0379c
  • Platform: Presumably all - found on Ubuntu 17.04, confirmed on macOS Sierra

Current Behavior:

When using eng.traineddata from tessdata_fast in -psm 0 mode Tesseract crashes for all input files. Example:

b095cb28b6c868b99d19e1c64b48a626bc4cb944  osd.traineddata
31abd495e0f719db4f524c447e9d855124a0b0d6  eng.traineddata
$ tesseract -psm 0 testing/phototest.tif stdout
Segmentation fault

Behaviour is the same using tessdata_best.

After replacing with tessdata/eng.traineddata, OSD works fine:

b095cb28b6c868b99d19e1c64b48a626bc4cb944  osd.traineddata
cdcfae0c5c272b5b2f0406cc91ac5d022f7df7f4 eng.traineddata
$ tesseract -psm 0 testing/phototest.tif stdout
Page 1
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 15.98
Script: Latin
Script confidence: 460.00

I discovered this in the Ubuntu 17.04 PPA (ppa:alex-p/tesseract-ocr) and replicated it on macOS tesseract built from source.

Stack trace

Stack trace (Linux version)

#0  0x00007f5e14a5494d in tesseract::Classify::CharNormClassifier(TBLOB*, tesseract::TrainingSample const&, ADAPT_RESULTS*) () from /usr/lib/libtesseract.so.4
#1  0x00007f5e14a55885 in tesseract::Classify::DoAdaptiveMatch(TBLOB*, ADAPT_RESULTS*) () from /usr/lib/libtesseract.so.4
#2  0x00007f5e14a55f89 in tesseract::Classify::AdaptiveClassifier(TBLOB*, BLOB_CHOICE_LIST*) () from /usr/lib/libtesseract.so.4
#3  0x00007f5e14973c82 in os_detect_blob(BLOBNBOX*, OrientationDetector*, ScriptDetector*, OSResults*, tesseract::Tesseract*) ()
   from /usr/lib/libtesseract.so.4
#4  0x00007f5e1497413b in os_detect_blobs(GenericVector<int> const*, BLOBNBOX_CLIST*, OSResults*, tesseract::Tesseract*) () from /usr/lib/libtesseract.so.4
#5  0x00007f5e1497453d in os_detect(TO_BLOCK_LIST*, OSResults*, tesseract::Tesseract*) () from /usr/lib/libtesseract.so.4
#6  0x00007f5e14974782 in orientation_and_script_detection(STRING&, OSResults*, tesseract::Tesseract*) () from /usr/lib/libtesseract.so.4
#7  0x00007f5e14943f8f in tesseract::TessBaseAPI::DetectOS(OSResults*) ()
   from /usr/lib/libtesseract.so.4
#8  0x00007f5e149440b9 in tesseract::TessBaseAPI::DetectOrientationScript(int*, float*, char const**, float*) () from /usr/lib/libtesseract.so.4
#9  0x00007f5e149441b1 in tesseract::TessBaseAPI::GetOsdText(int) ()
   from /usr/lib/libtesseract.so.4
#10 0x00007f5e1494bcd4 in tesseract::TessOsdRenderer::AddImageHandler(tesseract::TessBaseAPI*) () from /usr/lib/libtesseract.so.4
@amitdo
Copy link
Collaborator

amitdo commented Oct 12, 2017

tesseract -psm 0 testing/phototest.tif stdout

The eng is the default lang even when you use --psm 0.

The OSD module is based on the legacy engine. 'fast' and 'best' traineddatas were trained for LSTM only.

@amitdo
Copy link
Collaborator

amitdo commented Oct 12, 2017

Try this:

tesseract --psm 0 -l osd testing/phototest.tif stdout

@jbarlow83
Copy link
Author

The explicit -l osd works around the issue.

Although it shouldn't crash even if the arguments are invalid.

@brlin-tw
Copy link
Contributor

brlin-tw commented Jan 8, 2018

Here's a more complete stack trace from issue #1258 with commit 000d027 with debugging enabled:

#0  tesseract::Classify::CharNormClassifier (this=0x7ffff7fd1010, blob=0x5d3f080, sample=..., 
    adapt_results=0x5cc7630) at adaptmatch.cpp:1349
#1  0x00007ffff77052a8 in tesseract::Classify::DoAdaptiveMatch (this=0x7ffff7fd1010, Blob=0x5d3f080, 
    Results=0x5cc7630) at adaptmatch.cpp:1581
#2  0x00007ffff76fff89 in tesseract::Classify::AdaptiveClassifier (this=0x7ffff7fd1010, Blob=0x5d3f080, 
    Choices=0x7fffffffc0d0) at adaptmatch.cpp:192
#3  0x00007ffff75feb32 in os_detect_blob (bbox=0x5cd6870, o=0x7fffffffc170, s=0x7fffffffc180, 
    osr=0x7fffffffcc50, tess=0x7ffff7fd1010) at osdetect.cpp:354
#4  0x00007ffff75fe756 in os_detect_blobs (allowed_scripts=0x0, blob_list=0x7fffffffca00, 
    osr=0x7fffffffcc50, tess=0x7ffff7fd1010) at osdetect.cpp:305
#5  0x00007ffff75fe490 in os_detect (port_blocks=0x7fffffffcb90, osr=0x7fffffffcc50, tess=0x7ffff7fd1010)
    at osdetect.cpp:264
#6  0x00007ffff75fe0b1 in orientation_and_script_detection (filename=..., osr=0x7fffffffcc50, 
    tess=0x7ffff7fd1010) at osdetect.cpp:225
#7  0x00007ffff75c047e in tesseract::TessBaseAPI::DetectOS (this=0x607360 <main::api>, osr=0x7fffffffcc50)
    at baseapi.cpp:2382
#8  0x00007ffff75be7db in tesseract::TessBaseAPI::DetectOrientationScript (this=0x607360 <main::api>, 
    orient_deg=0x7fffffffd420, orient_conf=0x7fffffffd424, script_name=0x7fffffffd438, 
    script_conf=0x7fffffffd428) at baseapi.cpp:1896
#9  0x00007ffff75be8fd in tesseract::TessBaseAPI::GetOsdText (this=0x607360 <main::api>, page_number=0)
    at baseapi.cpp:1928
#10 0x00007ffff75cb8bc in tesseract::TessOsdRenderer::AddImageHandler (this=0x81c890, 
    api=0x607360 <main::api>) at renderer.cpp:268
#11 0x00007ffff75cafe5 in tesseract::TessResultRenderer::AddImage (this=0x81c890, api=0x607360 <main::api>)
    at renderer.cpp:86
#12 0x00007ffff75bbd11 in tesseract::TessBaseAPI::ProcessPage (this=0x607360 <main::api>, pix=0x40d8140, 
    page_index=0, filename=0x7fffffffde53 "/tmp/com.github.ocrmypdf.ec7wbvyw/000001.ocr.png", 
    retry_config=0x0, timeout_millisec=0, renderer=0x81c890) at baseapi.cpp:1224
#13 0x00007ffff75bb973 in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x607360 <main::api>, 
    filename=0x7fffffffde53 "/tmp/com.github.ocrmypdf.ec7wbvyw/000001.ocr.png", retry_config=0x0, 
    timeout_millisec=0, renderer=0x81c890) at baseapi.cpp:1156
#14 0x00007ffff75bb385 in tesseract::TessBaseAPI::ProcessPages (this=0x607360 <main::api>, 
    filename=0x7fffffffde53 "/tmp/com.github.ocrmypdf.ec7wbvyw/000001.ocr.png", retry_config=0x0, 
    timeout_millisec=0, renderer=0x81c890) at baseapi.cpp:1056
#15 0x0000000000403ae6 in main (argc=11, argv=0x7fffffffda28) at tesseractmain.cpp:529

@Shreeshrii
Copy link
Collaborator

@stweil Should the program use -l osd by default internally for 4.0.0 when --psm 0 is used?

@amitdo
Copy link
Collaborator

amitdo commented Apr 30, 2018

Should the program use -l osd by default internally for 4.0.0 when --psm 0 is used?

Yes, IMO.

@stweil
Copy link
Contributor

stweil commented Sep 20, 2018

I think this was fixed in commit 27ce472, so the issue can be closed.

@zdenop zdenop closed this as completed Sep 20, 2018
@amitdo amitdo added the OSD Orientation and Script Detection label May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OSD Orientation and Script Detection
Projects
None yet
Development

No branches or pull requests

6 participants