Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAGE validator: respect reading/textline order/direction #442

Merged
merged 7 commits into from
May 27, 2020

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Feb 18, 2020

(in multi-level text consistency checks)

This takes existing ideas from ocrd_tesserocr.recognize.page_update_higher_textequiv_levels() into the validator (only slightly more general).

This should be considered a (late) follow-up on OCR-D/assets#16. In particular, one should verify the refined rules look better on real-world GT which contains (concatenated) subregions, non-default @textLineOrder and non-default @readingDirection, and then re-visit the existing textual consistency rules laid out in the spec.

@bertsky bertsky requested a review from kba February 18, 2020 08:11
@codecov-io
Copy link

codecov-io commented Feb 18, 2020

Codecov Report

Merging #442 into master will decrease coverage by 2.09%.
The diff coverage is 59.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #442      +/-   ##
==========================================
- Coverage   81.93%   79.83%   -2.10%     
==========================================
  Files          39       40       +1     
  Lines        2314     2440     +126     
  Branches      426      463      +37     
==========================================
+ Hits         1896     1948      +52     
- Misses        346      408      +62     
- Partials       72       84      +12     
Impacted Files Coverage Δ
..._validators/ocrd_validators/workspace_validator.py 84.61% <ø> (ø)
ocrd_validators/ocrd_validators/page_validator.py 78.07% <59.30%> (-3.18%) ⬇️
ocrd_utils/ocrd_utils/logging.py 96.15% <0.00%> (-3.85%) ⬇️
ocrd_utils/ocrd_utils/__init__.py 66.01% <0.00%> (-0.92%) ⬇️
ocrd/ocrd/cli/workspace.py 71.79% <0.00%> (-0.36%) ⬇️
ocrd_utils/ocrd_utils/constants.py 100.00% <0.00%> (ø)
ocrd_models/ocrd_models/ocrd_page.py 100.00% <0.00%> (ø)
ocrd_models/ocrd_page_user_methods.py 0.00% <0.00%> (ø)
ocrd_models/ocrd_models/ocrd_mets.py 94.61% <0.00%> (+0.06%) ⬆️
ocrd/ocrd/processor/base.py 88.04% <0.00%> (+0.68%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7f715f...fc31557. Read the comment docs.

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great insofar as I can follow. I would really appreciate tests. We have some samples with ReadingOrder but they are very basic. Can we provide better test data? @tboenig

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 21, 2020

@kba the last commit prevents an exception (instead of a negative result) during validation when coords have less than 4 points or baselines less than 2.

Looks great insofar as I can follow. I would really appreciate tests. We have some samples with ReadingOrder but they are very basic. Can we provide better test data? @tboenig

IIUC the main obstacle for GT activities is the lack of OCR-D/ocrd_segment#32 which itself is due to missing progress on #443. In other words: We need a way to fix most of the gazillions of GT errors now reported by the existing (already merged) coordinate validator automatically before we can move on.

@bertsky bertsky force-pushed the page-validator-text-consistency-order branch from f54526e to 8ca626f Compare May 10, 2020 16:51
@kba
Copy link
Member

kba commented May 12, 2020

Looks good and PRImA-Research-Lab/PAGE-XML#23 shows the benefits. I will merge tomorrow unless @tboenig objects.

@kba
Copy link
Member

kba commented May 14, 2020

@bertsky say the word and I'll merge :)

@bertsky
Copy link
Collaborator Author

bertsky commented May 14, 2020

say the word and I'll merge :)

I'd say it's not perfect now, esp. 6bf98d0 if PRImA-Research-Lab/PAGE-XML#23 does turn out to reveal PAGE coordinates are all "pixel-center" not "pixel-below-right" (in which case we would have a lot of work going over all processors) – but much better than before, anyway. So let's revisit when a verdict on coordinate semantics has been found (but merge for now).

@kba kba merged commit ca06435 into OCR-D:master May 27, 2020
@bertsky bertsky deleted the page-validator-text-consistency-order branch May 27, 2020 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants