Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement Alternativeimage-based processing #48

Merged
merged 6 commits into from
Jul 3, 2019

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Jun 28, 2019

Fixes #26 (superseding #27) and #33 – the latter of course leaves some open points to be discussed in the spec. The idea and the functions in common (for adoption in core) are the same as for ocropy.

This also supersedes #34 (which I based on, but I was not allowed to force-push there after clean rebasing).

wrznr and others added 2 commits June 26, 2019 23:19
Via `AnalyseLayout` and `OSD`, tesseract can determine the skew angle
for images. The new wrapper applies this to pages and regions. It is
not clear yet how to save the estimated skew angle in PAGE
XML. Cf. PRImA-Research-Lab/PAGE-XML#9
- base all processors on AlternativeImage
- make all processors create a parameter MetadataItem
- make all processors create output file names
  from the input files, and use .xml extension for PAGE
- introduce a `common` module along the lines
  of the ocropy wrapper (but without
  ocropy-specific segmentation), i.e. functions
  to be moved into core:
  - polygon_mask
  - rotate_polygon
  - image_from_page
  - image_from_region
  - image_from_line
  - image_from_word
  - image_from_glyph
  - save_image_file
  - bbox_from_points
  - points_from_bbox
  - xywh_from_bbox
  - bbox_from_xywh
  - points_from_polygon
- in crop:
  - set textord_tabfind_find_tables=0 (because with
    table detection, the hinge often gets confused
    with a table column)
  - if a Border already exists, warn that it will
    be overwritten
  - if TextRegions already exist, calculate their
    common extent and warn it will be ignored
  - use PSM.SPARSE_TEXT instead of PSM.AUGO (so
    no images regions creep into neighbouring pages)
  - ignore regions which are empty after binarization
  - ignore regions with tiny width or height (< 30px)
  - add a padding to the result on all sides (4px)
  - do not annotate a (wrong) Border if no regions
    have been found
- in deskew:
  - convert skewing angle from radians to degrees,
    and mind the direction (clockwise in PAGE, but
    mathematically positive in Pillow) and map to
    the numeric interval (-179,180)
  - add orientation (+90/180/270) to skewing angle
  - also rotate the raw image of the page/region
    (expand and fill with white) and store as file;
    reference in METS (under OCR-D-IMG-DESKEW) and
    in PAGE (as AlternativeImage, with appropriate
    comments)
  - annotate writing direction and textline order
    in PAGE too
  - use OSD (DetectOrientationScript) in addition to
    layout analysis (AnalyseLayout/Orientation), with
    confidence thresholds (>= 10):
    ensure that orientation is consistent between both
    (and in case of conflict, use the former), also
    annotate primary script; init appropriately
    (i.e. load "osd", use legacy OEM and AUTO_OSD)
  - on region level, process TableRegions as well
  - change default operation_level to region (because
    we still cannot annotate orientation on page level)
- in segment_region:
  - add parameter `find_tables` (default: true) to allow
    disabling table detection (textord_tabfind_find_tables=0),
    so they can be analysed into independent text/sep regions
  - add parameter `overwrite_regions` (default: true) to
    allow enabling removal of any existing text regions
  - unconditionally remove any existing non-text regions
    and reading order groups
  - cover PT.VERTICAL_TEXT (as TextRegionType) and PT.TABLE
    (as TableRegionType)
  - use BlockPolygon (if present) to annotate polygon outline
    in Coords – but comment away, because patch against tesserocr
    segfaults awaits merge
  - add parameter `crop_polygons` (default: false) to enable:
    retrieve the raw region image along the (internal) polygon
    outline, store image as file, and reference in METS (under
    OCR-D-IMG-CROP) and in PAGE (as AlternativeImage)
- in segment_line, add parameter `overwrite_lines` (default: true) to
    allow enabling removal of any existing text lines
- in segment_word, add parameter `overwrite_words` (default: true) to
    allow enabling removal of any existing words
- new processor binarize:
  - operate on page, region or line level
  - retrieve cropped, raw page/region/line image, then enter
    PSM.AUTO/SINGLE_BLOCK/SINGLE_LINE, and run layout analysis
    on the image, retrieve the binary image for RIL.BLOCK/TEXTLINE
    store image as file, and reference in METS (under OCR-D-IMG-BIN)
    and in PAGE (as AlternativeImage)
- improve docstrings
- remove redundant locale workaround from config
  (already in __init__)
- version 0.2.2 → 0.2.3
Copy link
Contributor

@wrznr wrznr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great, great step forward. Smaller PRs would be helpful in future! Two important comments:

  1. Let's put he "common" parts to core asap.
  2. Let's minimize the number of constants set in the code and expose them as parameters with sane defaults.

.pylintrc Outdated Show resolved Hide resolved
ocrd_tesserocr/binarize.py Outdated Show resolved Hide resolved
ocrd_tesserocr/common.py Outdated Show resolved Hide resolved
ocrd_tesserocr/crop.py Outdated Show resolved Hide resolved
ocrd_tesserocr/crop.py Outdated Show resolved Hide resolved
ocrd_tesserocr/deskew.py Outdated Show resolved Hide resolved
ocrd_tesserocr/deskew.py Show resolved Hide resolved
ocrd_tesserocr/recognize.py Show resolved Hide resolved
ocrd_tesserocr/segment_region.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved

TOOL = 'ocrd-tesserocr-binarize'
LOG = getLogger('processor.TesserocrBinarize')
FILEGRP_IMG = 'OCR-D-IMG-BIN'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only be a temporary solution (cf. OCR-D/spec#117).

@bertsky bertsky merged commit 585414c into OCR-D:master Jul 3, 2019
@bertsky bertsky deleted the alternative-image branch July 3, 2019 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File names
2 participants