Skip to content
This repository has been archived by the owner on Sep 28, 2020. It is now read-only.

cis-ocropy-segment crashes without error log #7

Closed
beckstefan opened this issue Sep 16, 2020 · 10 comments
Closed

cis-ocropy-segment crashes without error log #7

beckstefan opened this issue Sep 16, 2020 · 10 comments

Comments

@beckstefan
Copy link

With up-do-date docker I get when running

'olena-binarize -I OCR-D-IMG -O OCR-D-BIN' \
'cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW' \
'anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP' \
'cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page' \
'tesserocr-recognize -I OCR-D-PAGE-SEG -O OCR-D-OCR -P model Fraktur'

(The workflow is an attempt to get the three columns recognized correctly in http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24)

2020-09-16 11:33:03,918.918 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP -p '{"force": true, "col
Separator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax":
 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, "operation_level": "page"}''
2020-09-16 11:33:03,920.920 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -p '{"level-of-operati
on": "page", "dpi": 0, "maxcolseps": 20, "maxseps": 20, "maximages": 10, "csminheight": 4, "hlminwidth": 10, "gap_height": 0.01, "gap_width": 1.5, "overwrite_or
der": true, "overwrite_separators": true, "overwrite_regions": true, "overwrite_lines": true, "spread": 2.4}''
Traceback (most recent call last):
  File "/usr/bin/ocrd", line 33, in <module>
    sys.exit(load_entry_point('ocrd', 'console_scripts', 'ocrd')())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/build/core/ocrd/ocrd/cli/process.py", line 28, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/build/core/ocrd/ocrd/task_sequence.py", line 149, in run_tasks
    raise Exception("%s exited with non-zero return value %s. STDOUT:\n%s\nSTDERR:\n%s" % (task.executable, returncode, out, err))
Exception: ocrd-cis-ocropy-segment exited with non-zero return value -9. STDOUT:

STDERR:


Continuing manually to get the error:

docker run --rm -v /home/ocrd/workspace/gue-11660-24-e-1/:/data -- ocrd/all:maximum ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page -l DEBUG

Resultet in nothing happening (no output to terminal, no folder OCR-D-PAGE-SEG, except for an exit code of 137

The source images are relatively big (10MB, jpeg), but I can provide them in case of need as well.

@kba
Copy link
Member

kba commented Sep 16, 2020

the missing STDERR was reported this week, it's a bug in core I will try to fix asap.

no idea about the exit code. but I generally discourage ocrd_ocropy, there is a much better version in ocrd_cis.

@beckstefan beckstefan changed the title ocropy-segment crashes without error log cis-ocropy-segment crashes without error log Sep 16, 2020
@beckstefan
Copy link
Author

I wasn't really aware that there is ocrd-ocropy and ocrd-cis(-ocropy?), so the title was maybe misleading. I used ocrd-cis-ocropy-segment.

About the STDERR I am not sure, because running directly doesn't give any STDERR (and neither STDOUT). Or does ocr-cis-ocropy-segmetn use ocrd_core?

@bertsky
Copy link
Collaborator

bertsky commented Sep 16, 2020

the missing STDERR was reported this week, it's a bug in core I will try to fix asap.

@kba you mean OCR-D/core#592?

I wasn't really aware that there is ocrd-ocropy and ocrd-cis(-ocropy?), so the title was maybe misleading. I used ocrd-cis-ocropy-segment.

@beckstefan The main work on wrapping Ocropy for OCR-D and improving it was done in ocrd_cis, whereas ocrd_ocropy does not offer anything useful yet and is currently inactive.

(I have no rights to transfer the issue to ocrd_cis, but also I am not sure it does belong there, as the problem seems to be in core's ocrd process.)

About the STDERR I am not sure, because running directly doesn't give any STDERR (and neither STDOUT). Or does ocr-cis-ocropy-segmetn use ocrd_core?

All OCR-D wrappers (Python and bash based) use OCR-D/core. What @kba was saying was that the missing log messages are a problem specific to ocrd process (which is part of core).

So, could you please run your workflow directly, by calling the individual processor CLIs instead? (So we at least know what led up to the exit -9?)

For your workflow, that'll be:

ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW
ocrd-anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP
ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page
ocrd-tesserocr-recognize -I OCR-D-PAGE-SEG -O OCR-D-OCR -P model Fraktur

@beckstefan
Copy link
Author

beckstefan commented Sep 17, 2020

docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN
2020-09-17 10:33:39,691.691 INFO ocrd-olena-binarize - processing image/jpeg input file OCR-D-IMG-0001.jpg (img-0001.jpg)
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
2020-09-17 10:33:51,541.541 INFO ocrd-olena-binarize - processing image/jpeg input file OCR-D-IMG-0002.jpg (img-0002.jpg)
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW -P level-of-operation page -l DEBUG
2020-09-17 10:36:14,290.290 INFO ocrd.process.profile - Executing processor 'ocrd-cis-ocropy-deskew' took 92.366838s [--input-file-grp='OCR-D-BIN' --output-file-grp='OCR-D-DESKEW' --parameter='{"level-of-operation": "page", "maxskew": 5.0}'
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP              Matplotlib created a temporary config/cache directory at /tmp/matplotlib-w545rw0z because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2020-09-17 10:38:34,867.867 INFO ocrd.process.profile - Executing processor 'ocrd-anybaseocr-crop' took 32.927905s [--input-file-grp='OCR-D-DESKEW' --output-file-grp='OCR-D-CROP' --parameter='{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, "operation_level": "page"}']
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page
ocrd@ocrd:~$ echo $?
137

@bertsky
Copy link
Collaborator

bertsky commented Sep 17, 2020

Thanks @beckstefan, we are getting there... it looks like core does not even start up the processor (as there's no message from the profile logger). Could you please run the last step with -l DEBUG?

@beckstefan
Copy link
Author

No change in output.

(As I figured out, based on the above workflow, I also tried tesser-ocr-segment and anybaseocr-block-segmentation and got disgusting results, i.e. almost no segmentation was done, but that's a different issue for now.)

@bertsky
Copy link
Collaborator

bertsky commented Sep 17, 2020

No change in output.

I see. Well turns out I was wrong, the profile message only appears after the processor ran – unless it crashed.

Exit 137 could mean your container went out of memory for some reason, and ocrd-cis-ocropy-segment might be inefficient with large images. None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)

We really need to get our hands on the DEBUG level messages here. I'm afraid -l DEBUG being ineffective is a result of
OCR-D/core#597. As a workaround, you could set all loggers to that level in your ocrd_logging.conf. For your Docker installation, that would entail:

  1. create a text file with the following content:
[loggers]
keys=root
[handlers]
keys=consoleHandler
[formatters]
keys=defaultFormatter
[logger_root]
level=DEBUG
handlers=consoleHandler
[handler_consoleHandler]
class=StreamHandler
formatter=defaultFormatter
args=(sys.stdout,)
[formatter_defaultFormatter]
format=%(levelname)s %(name)s - %(message)s
datefm=%H:%M:%S
  1. spin up the container for the processor with additional options mounting that file:
docker run ... --mount type=bind,source=ocrd_logging.conf,destination=/etc/ocrd_logging.conf ...

@beckstefan
Copy link
Author

Exit 137 could mean your container went out of memory for some reason, and ocrd-cis-ocropy-segment might be inefficient with large images. None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)

Shrugs. I should have looked up 137 by myself. Indeed, memory consumption is enormous. The image has a size of 4987x6199px with 400dpi. I downscaled to 300dpi and the processed finished.

None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)

Generally speaking, our standard is 400dpi and especially newspaper tend to be big, can you roughly estimate

  • Memory consumption for given image resolution
  • How does downscaling affects recognition quality

I know, that there are no strict answers, but a rough tendency would be nice, knowing that in particular cases the statement won't apply.

And for completeness the output:

docker run --rm -u $(id -u) --mount type=bind,source=/home/ocrd/ocrd_logging.conf,destination=/etc/ocrd_logging.conf -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page -l DEBUG
DEBUG ocrd.resolver.download_to_directory - directory=|/data| url=|/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/mets.xml' (url: '/data/mets.xml')
DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 65536
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-2493.5 -3099.5]
DEBUG ocrd_utils.coords.rotate_coordinates - rotating coordinates by 0.50° around [2493.5 3099.5]
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [2520.45295194 3121.1415968 ]
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-59 -80]

@bertsky
Copy link
Collaborator

bertsky commented Sep 17, 2020

Generally speaking, our standard is 400dpi and especially newspaper tend to be big

Indeed. OCR-D of course was developed mainly focussing on printed books. Existing processors don't downscale by themselves, and we have not yet allowed making downscaled annotations as a preprocessing step. (That's because we first need PAGE-XML to support representing scale.)

I don't think it's necessary to change the functional model to support crop-based partial processing though. Machines will become more powerful, while newspapers don't grow.

can you roughly estimate

* Memory consumption for given image resolution

Phew, that's a tough (but good) question. We've made some runtime performance statistics, but without varying/factoring DPI and without looking at memory consumption yet.

Generally most rule-based processors will use algorithms of at least O(n²) in pixel resolution. But for the proportional I don't even have a ball park number. I could give you anecdotal measurements, but I have no statistics yet. Maybe we'll start gathering this though.

* How does downscaling affects recognition quality

300 DPI should always be good enough. Some processors (esp. for preprocessing and segmentation) may even run suboptimal on larger (> 500 DPI) resolutions (if they are badly written, with fixed parameters assuming a certain density).

@bertsky
Copy link
Collaborator

bertsky commented Sep 17, 2020

Maybe you should open an issue on OCR-D/ocrd-website for documenting (rough estimates) of resource requirements.

Can we close?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants