-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PAGE validator: respect reading/textline order/direction #442
PAGE validator: respect reading/textline order/direction #442
Conversation
…el text consistency checks
Codecov Report
@@ Coverage Diff @@
## master #442 +/- ##
==========================================
- Coverage 81.93% 79.83% -2.10%
==========================================
Files 39 40 +1
Lines 2314 2440 +126
Branches 426 463 +37
==========================================
+ Hits 1896 1948 +52
- Misses 346 408 +62
- Partials 72 84 +12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great insofar as I can follow. I would really appreciate tests. We have some samples with ReadingOrder but they are very basic. Can we provide better test data? @tboenig
@kba the last commit prevents an exception (instead of a negative result) during validation when coords have less than 4 points or baselines less than 2.
IIUC the main obstacle for GT activities is the lack of OCR-D/ocrd_segment#32 which itself is due to missing progress on #443. In other words: We need a way to fix most of the gazillions of GT errors now reported by the existing (already merged) coordinate validator automatically before we can move on. |
f54526e
to
8ca626f
Compare
Looks good and PRImA-Research-Lab/PAGE-XML#23 shows the benefits. I will merge tomorrow unless @tboenig objects. |
@bertsky say the word and I'll merge :) |
I'd say it's not perfect now, esp. 6bf98d0 if PRImA-Research-Lab/PAGE-XML#23 does turn out to reveal PAGE coordinates are all "pixel-center" not "pixel-below-right" (in which case we would have a lot of work going over all processors) – but much better than before, anyway. So let's revisit when a verdict on coordinate semantics has been found (but merge for now). |
(in multi-level text consistency checks)
This takes existing ideas from
ocrd_tesserocr.recognize.page_update_higher_textequiv_levels()
into the validator (only slightly more general).This should be considered a (late) follow-up on OCR-D/assets#16. In particular, one should verify the refined rules look better on real-world GT which contains (concatenated) subregions, non-default
@textLineOrder
and non-default@readingDirection
, and then re-visit the existing textual consistency rules laid out in the spec.