-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement Alternativeimage-based processing #48
Conversation
Via `AnalyseLayout` and `OSD`, tesseract can determine the skew angle for images. The new wrapper applies this to pages and regions. It is not clear yet how to save the estimated skew angle in PAGE XML. Cf. PRImA-Research-Lab/PAGE-XML#9
- base all processors on AlternativeImage - make all processors create a parameter MetadataItem - make all processors create output file names from the input files, and use .xml extension for PAGE - introduce a `common` module along the lines of the ocropy wrapper (but without ocropy-specific segmentation), i.e. functions to be moved into core: - polygon_mask - rotate_polygon - image_from_page - image_from_region - image_from_line - image_from_word - image_from_glyph - save_image_file - bbox_from_points - points_from_bbox - xywh_from_bbox - bbox_from_xywh - points_from_polygon - in crop: - set textord_tabfind_find_tables=0 (because with table detection, the hinge often gets confused with a table column) - if a Border already exists, warn that it will be overwritten - if TextRegions already exist, calculate their common extent and warn it will be ignored - use PSM.SPARSE_TEXT instead of PSM.AUGO (so no images regions creep into neighbouring pages) - ignore regions which are empty after binarization - ignore regions with tiny width or height (< 30px) - add a padding to the result on all sides (4px) - do not annotate a (wrong) Border if no regions have been found - in deskew: - convert skewing angle from radians to degrees, and mind the direction (clockwise in PAGE, but mathematically positive in Pillow) and map to the numeric interval (-179,180) - add orientation (+90/180/270) to skewing angle - also rotate the raw image of the page/region (expand and fill with white) and store as file; reference in METS (under OCR-D-IMG-DESKEW) and in PAGE (as AlternativeImage, with appropriate comments) - annotate writing direction and textline order in PAGE too - use OSD (DetectOrientationScript) in addition to layout analysis (AnalyseLayout/Orientation), with confidence thresholds (>= 10): ensure that orientation is consistent between both (and in case of conflict, use the former), also annotate primary script; init appropriately (i.e. load "osd", use legacy OEM and AUTO_OSD) - on region level, process TableRegions as well - change default operation_level to region (because we still cannot annotate orientation on page level) - in segment_region: - add parameter `find_tables` (default: true) to allow disabling table detection (textord_tabfind_find_tables=0), so they can be analysed into independent text/sep regions - add parameter `overwrite_regions` (default: true) to allow enabling removal of any existing text regions - unconditionally remove any existing non-text regions and reading order groups - cover PT.VERTICAL_TEXT (as TextRegionType) and PT.TABLE (as TableRegionType) - use BlockPolygon (if present) to annotate polygon outline in Coords – but comment away, because patch against tesserocr segfaults awaits merge - add parameter `crop_polygons` (default: false) to enable: retrieve the raw region image along the (internal) polygon outline, store image as file, and reference in METS (under OCR-D-IMG-CROP) and in PAGE (as AlternativeImage) - in segment_line, add parameter `overwrite_lines` (default: true) to allow enabling removal of any existing text lines - in segment_word, add parameter `overwrite_words` (default: true) to allow enabling removal of any existing words - new processor binarize: - operate on page, region or line level - retrieve cropped, raw page/region/line image, then enter PSM.AUTO/SINGLE_BLOCK/SINGLE_LINE, and run layout analysis on the image, retrieve the binary image for RIL.BLOCK/TEXTLINE store image as file, and reference in METS (under OCR-D-IMG-BIN) and in PAGE (as AlternativeImage) - improve docstrings - remove redundant locale workaround from config (already in __init__) - version 0.2.2 → 0.2.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great, great step forward. Smaller PRs would be helpful in future! Two important comments:
- Let's put he "common" parts to
core
asap. - Let's minimize the number of constants set in the code and expose them as parameters with sane defaults.
|
||
TOOL = 'ocrd-tesserocr-binarize' | ||
LOG = getLogger('processor.TesserocrBinarize') | ||
FILEGRP_IMG = 'OCR-D-IMG-BIN' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can only be a temporary solution (cf. OCR-D/spec#117).
Fixes #26 (superseding #27) and #33 – the latter of course leaves some open points to be discussed in the spec. The idea and the functions in
common
(for adoption in core) are the same as for ocropy.This also supersedes #34 (which I based on, but I was not allowed to force-push there after clean rebasing).