implement Alternativeimage-based processing #48

bertsky · 2019-06-28T09:38:17Z

Fixes #26 (superseding #27) and #33 – the latter of course leaves some open points to be discussed in the spec. The idea and the functions in common (for adoption in core) are the same as for ocropy.

This also supersedes #34 (which I based on, but I was not allowed to force-push there after clean rebasing).

Via `AnalyseLayout` and `OSD`, tesseract can determine the skew angle for images. The new wrapper applies this to pages and regions. It is not clear yet how to save the estimated skew angle in PAGE XML. Cf. PRImA-Research-Lab/PAGE-XML#9

- base all processors on AlternativeImage - make all processors create a parameter MetadataItem - make all processors create output file names from the input files, and use .xml extension for PAGE - introduce a `common` module along the lines of the ocropy wrapper (but without ocropy-specific segmentation), i.e. functions to be moved into core: - polygon_mask - rotate_polygon - image_from_page - image_from_region - image_from_line - image_from_word - image_from_glyph - save_image_file - bbox_from_points - points_from_bbox - xywh_from_bbox - bbox_from_xywh - points_from_polygon - in crop: - set textord_tabfind_find_tables=0 (because with table detection, the hinge often gets confused with a table column) - if a Border already exists, warn that it will be overwritten - if TextRegions already exist, calculate their common extent and warn it will be ignored - use PSM.SPARSE_TEXT instead of PSM.AUGO (so no images regions creep into neighbouring pages) - ignore regions which are empty after binarization - ignore regions with tiny width or height (< 30px) - add a padding to the result on all sides (4px) - do not annotate a (wrong) Border if no regions have been found - in deskew: - convert skewing angle from radians to degrees, and mind the direction (clockwise in PAGE, but mathematically positive in Pillow) and map to the numeric interval (-179,180) - add orientation (+90/180/270) to skewing angle - also rotate the raw image of the page/region (expand and fill with white) and store as file; reference in METS (under OCR-D-IMG-DESKEW) and in PAGE (as AlternativeImage, with appropriate comments) - annotate writing direction and textline order in PAGE too - use OSD (DetectOrientationScript) in addition to layout analysis (AnalyseLayout/Orientation), with confidence thresholds (>= 10): ensure that orientation is consistent between both (and in case of conflict, use the former), also annotate primary script; init appropriately (i.e. load "osd", use legacy OEM and AUTO_OSD) - on region level, process TableRegions as well - change default operation_level to region (because we still cannot annotate orientation on page level) - in segment_region: - add parameter `find_tables` (default: true) to allow disabling table detection (textord_tabfind_find_tables=0), so they can be analysed into independent text/sep regions - add parameter `overwrite_regions` (default: true) to allow enabling removal of any existing text regions - unconditionally remove any existing non-text regions and reading order groups - cover PT.VERTICAL_TEXT (as TextRegionType) and PT.TABLE (as TableRegionType) - use BlockPolygon (if present) to annotate polygon outline in Coords – but comment away, because patch against tesserocr segfaults awaits merge - add parameter `crop_polygons` (default: false) to enable: retrieve the raw region image along the (internal) polygon outline, store image as file, and reference in METS (under OCR-D-IMG-CROP) and in PAGE (as AlternativeImage) - in segment_line, add parameter `overwrite_lines` (default: true) to allow enabling removal of any existing text lines - in segment_word, add parameter `overwrite_words` (default: true) to allow enabling removal of any existing words - new processor binarize: - operate on page, region or line level - retrieve cropped, raw page/region/line image, then enter PSM.AUTO/SINGLE_BLOCK/SINGLE_LINE, and run layout analysis on the image, retrieve the binary image for RIL.BLOCK/TEXTLINE store image as file, and reference in METS (under OCR-D-IMG-BIN) and in PAGE (as AlternativeImage) - improve docstrings - remove redundant locale workaround from config (already in __init__) - version 0.2.2 → 0.2.3

wrznr

That's a great, great step forward. Smaller PRs would be helpful in future! Two important comments:

Let's put he "common" parts to core asap.
Let's minimize the number of constants set in the code and expose them as parameters with sane defaults.

.pylintrc

ocrd_tesserocr/binarize.py

ocrd_tesserocr/common.py

ocrd_tesserocr/crop.py

ocrd_tesserocr/deskew.py

ocrd_tesserocr/recognize.py

ocrd_tesserocr/segment_region.py

setup.py

wrznr · 2019-07-03T10:13:55Z

ocrd_tesserocr/binarize.py

+
+TOOL = 'ocrd-tesserocr-binarize'
+LOG = getLogger('processor.TesserocrBinarize')
+FILEGRP_IMG = 'OCR-D-IMG-BIN'


This can only be a temporary solution (cf. OCR-D/spec#117).

wrznr and others added 2 commits June 26, 2019 23:19

[WIP] Add deskewing per tesseract

b598dee

Via `AnalyseLayout` and `OSD`, tesseract can determine the skew angle for images. The new wrapper applies this to pages and regions. It is not clear yet how to save the estimated skew angle in PAGE XML. Cf. PRImA-Research-Lab/PAGE-XML#9

bertsky mentioned this pull request Jun 28, 2019

[WIP] Add deskewing per tesseract #34

Closed

bertsky requested review from wrznr and kba June 28, 2019 09:40

This was referenced Jun 28, 2019

separate preprocessing steps and use AlternativeImage in ocropy wrappers cisocrgroup/ocrd_cis#10

Merged

more intuitive ID for output file, #26 #27

Closed

wrznr suggested changes Jul 3, 2019

View reviewed changes

wrznr mentioned this pull request Jul 3, 2019

Implement multiple output file groups per processor OCR-D/spec#117

Closed

bertsky added 3 commits July 3, 2019 12:08

change new version

3a4b8e6

remove resegmentation from common

6eaba4a

expose parameters, improve docstrings

8f6b94d

wrznr reviewed Jul 3, 2019

View reviewed changes

wrznr approved these changes Jul 3, 2019

View reviewed changes

no trace logging yet

8e3b953

bertsky merged commit 585414c into OCR-D:master Jul 3, 2019

bertsky deleted the alternative-image branch July 3, 2019 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement Alternativeimage-based processing #48

implement Alternativeimage-based processing #48

bertsky commented Jun 28, 2019

wrznr left a comment

wrznr Jul 3, 2019

implement Alternativeimage-based processing #48

implement Alternativeimage-based processing #48

Conversation

bertsky commented Jun 28, 2019

wrznr left a comment

Choose a reason for hiding this comment

wrznr Jul 3, 2019

Choose a reason for hiding this comment