-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More specifics on AlternativeImage processing #116
Comments
The above mentioned issue in ocrd_tesserocr has been closed now (because a first proof-of-concept implementation has been merged there), but the discussion of the open problems, and of adding detail to the spec should be continued. @wrznr Do you think I should copy my argumentation here, or can this be continued in the closed issue? |
Meanwhile, another related issue came up: Now that we have the possibility of implicit output file groups in METS-XML via the derived images referenced in PAGE-XML, no workspace engine will be able to know what side effects a processing step can have. For example, if one processor writes to Thus, these file groups must be made explicit so they can be validated (in the usual way). |
This off-topic has since re-appeared as off-topic in another issue, and it was decided to indeed change the specification to place all new derived images in the output fileGrp (along the PAGE-XML). However, the above original issue remains pressing: we have to reflect the problem of AlternativeImage coordinate consistency and its solution in core within the spec. This would entail:
|
Yet another open problem has surfaced: When a processor changes coordinates of some existing segment, it must also remove all existing derived images for that segment, because they will be invalid. This includes the following cases:
But this may also remove the only result of some previous workflow step (like binarization or denoising). So workflows would need to re-do them afterwards, and workflow writers must be aware of this. To that end, consuming processors should ideally fail immediately when they cannot find the derived images they expected. But that only happens when the implementors chose to use image feature selection/filtering (which is still optional). For modules like ocrd_tesserocr it's not easy to decide which image features should be present. Some workflows might want to include a binarization step, while others might want to use Tesseract's internal binarization. Also, at the processor level, it cannot be decided whether |
The spec should be more specific about how
AlternativeImage
must be used. There are issues of coordinate reproducibility and disambiguation, and we need another@comments
classrescaled
. See here for a full description. (I don't want to transfer the issue here.)The text was updated successfully, but these errors were encountered: