-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite #33
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see Kraken support going!
Wonder if the blla code couldn't give us the hierarchy (region-line mapping) directly...
Also, is there some (region) reading order detection anywhere in Kraken?
And here's a crop from the first example again, this time after 232a055 (enlarging regions to avoid extruding lines) and using PageViewer with PRImA-Research-Lab/prima-page-viewer#18 to show the baselines: Perhaps @mittagessen should comment on what we are seeing here. Is this what we should expect from the blla segmenter, or are these caused by bugs in kraken or bad wrapping in ocrd_kraken? |
I guess its just (very) crappy output caused by a combination of the
default model being only trained on handwritten text and high line count
on the page which tends to cause the line merging you're seeing. We have
changed the postprocessing a few weeks ago which is a bit more sensitive
to low-confidence detection but solved a number of other rather annoying
problems. Retraining the default model on a larger resolution than now
should largely resolve the problem and is fairly high up on my todo list.
Nothing wrong on your part. Try manuscripts next time ;). I'd also
gladly take high line-count (>30) print datasets to incorporate that
into the default model training data.
|
Thanks @mittagessen for these explanations. That makes we wonder whether you train on (fixed size) crops/tiles or on full images (see here for a study which tiling options work best), and whether you account for different pixel densities (see here for how the Qurator team deals with this)). (Using Mask-RCNN I used to get problems when mixing books and newspapers, esp. when DPI varied...) Could you please elaborate on your take regarding these aspects? As to datasets, how about PubLayNet (very large, but modern/synthetic), and datasets listed here under (EDIT: Some of these would need additional effort to get to text lines; or could you train the regions exclusively with blla?) @kba maybe in light of this, it makes most sense to also wrap both the blla and legacy line detectors in a region2line mode (or |
Could you please elaborate on your take regarding these aspects?
The net is trained on full pages with a normalized page height, per
default 1200px to keep the memory consumption below 5Gb for the method.
As mentioned the line merging disappears for all but the craziest
material (rotuli, maps, some inscriptions, newspapers) when increasing
this to 1600px (~33% line separation at net scale). That's been a design
decision since the first iteration of the method (U-Net) as even with
the standard tiling techniques we never got completely rid of border
effects. The current method is more or less a ReNet which might perform
even worse with the reduced context tiling provides but I haven't
evaluated it extensively. It is on our radar though as wanting to be
able to process the crazy stuff is our shtick.
We haven't encountered any issues relating to scale as described by the
qurator people. Anything between 75dpi to 600dpi+ seems to work
reasonably well in the same model even if not trained on that resolution
and with different input heights of the model. I'd guess that's largely
because even low resolution scans are at worst only ~50% smaller than
the 'native' input size so resizing effects are rather modest.
As to datasets, how about
[PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet) (very large, but
modern/synthetic)
Unfortunately, that one doesn't have the baselines necessary to train
the segmenter.
I'll sift through these in a bit, thx.
|
Yes, and it has more issues. Unfortunately, they did not publish the method or data of their PDF-XML alignment, so all we could do is post-process. But...
...is that even an option?
Also, there are quite a few more (not yet properly listed) under https://github.com/cneud/ocr-gt/issues |
Yes, you can train only regions or only lines (or a subset of types of either). The code actually supports multi-model inference for segmentation as well so you'd be able to mix and match models to your particular use-case. Of course with great flexibility comes great potential for blowing one's foot off. |
BTW about the reading order question above. It's a bit complicated as the segmenter is designed to allow detection of non-textual regions such as stamps. Thus, regions are per definition unordered but textual regions (anything that contains lines) are treated as dummy lines for the purpose of determining the reading order, e.g. (L = line, R = region): That will change mid-term though as we need a more capable reading order thingy for the semitic abjads and parallel texts and such. That one will have a more explicit ordering with additional semantics attached. |
Ok, that's what I do as well in my Ocropy fork – only that I use recursive X-Y cut for region segmentation/grouping. So, @kba we should try to wrap this functionality for PAGE here, too.
Am I right to assume you plan to do that with some neural modelling, @mittagessen? |
Am I right to assume you plan to do that with some neural modelling, @mittagessen?
When you have a hammer everything looks like a nail, so yes. I had some
basic code for a graph NN orderer but which features to actually use is
quite unknown.
|
Co-authored-by: Robert Sachunsky <[email protected]>
This reverts commit b85e147.
I had to revert the conditional binary input for the segmenter, because it looks like binarization always produces better results for blla.mlmodel. @mittagessen could you please comment? |
Do you have an example of the material you're testing on? We don't really do binarization anymore as it breaks most degraded manuscripts so the model wasn't even evaluated (nor trained) on it. Would be good to see what exactly is happening. |
Oh, in that case... I have compared with/out SBB binarization (using the latest model) on this material. |
Huh, interesting. Those look fairly similar to the stuff in cBAD but if the binarization is excellent it could give a boost in accuracy. In any case, I wouldn't force inputs to be binarized but if there's a good one available and you get better results there's no reason not to use it. |
In that case, we should make the choice dependent on the workflow – by passing an empty feature selector/filter when blla is used. I'll revert once again. @kba, if you could look into the CI permissions problem? |
(Leave the selector/filter empty, so it depends on the workflow: If binarization is available, it will get used.) This reverts commit 0eecf6c.
CI is working again, there was an issue with the deployment key and some minor typos and missing models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say this should be merged for now.
Documentation (I guess you meant README.md?) would be nice of course.
Features like reading order, better extraction of region-line hierarchy, more efficient decoding etc can be tracked by dedicated issues.
BTW, one thing we could also add is model URLs in the ocrd-tool.json for segmentation and for recognition (especially with the new models from UB Mannheim. (We could even make the suffix |
@kba I did all of the above and fixed the CI again (with a workaround for this new problem in core). Now ready for merging AFAICS. |
Rewrite the segmentation and add recognition with support for the upcoming kraken 3.0`
TODO