Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More specifics on AlternativeImage processing #116

Open
bertsky opened this issue Jun 26, 2019 · 5 comments
Open

More specifics on AlternativeImage processing #116

bertsky opened this issue Jun 26, 2019 · 5 comments
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Jun 26, 2019

The spec should be more specific about how AlternativeImage must be used. There are issues of coordinate reproducibility and disambiguation, and we need another @comments class rescaled. See here for a full description. (I don't want to transfer the issue here.)

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 4, 2019

The above mentioned issue in ocrd_tesserocr has been closed now (because a first proof-of-concept implementation has been merged there), but the discussion of the open problems, and of adding detail to the spec should be continued.

@wrznr Do you think I should copy my argumentation here, or can this be continued in the closed issue?

@wrznr
Copy link
Contributor

wrznr commented Jul 4, 2019

@bertsky Neither. We should continue the discussion here as soon as @cneud and @kba are available. But there is no need for copying.

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 28, 2019

Meanwhile, another related issue came up: Now that we have the possibility of implicit output file groups in METS-XML via the derived images referenced in PAGE-XML, no workspace engine will be able to know what side effects a processing step can have. For example, if one processor writes to OCR-D-IMG-BIN, and another does too, there can be conflicts – especially with parallel execution, but even with sequential execution.

Thus, these file groups must be made explicit so they can be validated (in the usual way).

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 23, 2020

Meanwhile, another related issue came up: Now that we have the possibility of implicit output file groups in METS-XML via the derived images referenced in PAGE-XML, no workspace engine will be able to know what side effects a processing step can have. For example, if one processor writes to OCR-D-IMG-BIN, and another does too, there can be conflicts – especially with parallel execution, but even with sequential execution.

This off-topic has since re-appeared as off-topic in another issue, and it was decided to indeed change the specification to place all new derived images in the output fileGrp (along the PAGE-XML).

However, the above original issue remains pressing: we have to reflect the problem of AlternativeImage coordinate consistency and its solution in core within the spec.

This would entail:

  • stating the general problem briefly (to clearly motivate these strict and elaborate requirements), but detailing aspects like reshaping during rotation, center of rotation, multi-level rotation, splitting @orientation into reflection vs rotation
  • upgrading the AlternativeImage/@comments classes ("image features") to mandatory
  • extending them appropriately to all needed features
  • explaining their interpretation in detail (including the difference between level-local features like deskewed and inherited features like binarized)
  • adding the principle of appending to AlternativeImage (not replacing/inserting), and appending its @comments (not starting empty but keeping all features of the image data it was derived from and then adding the new features)
  • addressing/discussing the open problems of down/up-scaling and dewarping

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 31, 2020

Yet another open problem has surfaced: When a processor changes coordinates of some existing segment, it must also remove all existing derived images for that segment, because they will be invalid. This includes the following cases:

  • whenever overwriting a Page's Border, or a Border's Coords or Coords/@points,
    remove all the Page's derived images with cropped,
  • whenever overwriting Region's or TextLine's or Word's or Glyph's Coords or Coords/@points,
    remove all its derived images,
  • whenever overwriting Page's or Region's @orientation,
    remove all its derived images with deskewed.

But this may also remove the only result of some previous workflow step (like binarization or denoising). So workflows would need to re-do them afterwards, and workflow writers must be aware of this. To that end, consuming processors should ideally fail immediately when they cannot find the derived images they expected. But that only happens when the implementors chose to use image feature selection/filtering (which is still optional).

For modules like ocrd_tesserocr it's not easy to decide which image features should be present. Some workflows might want to include a binarization step, while others might want to use Tesseract's internal binarization. Also, at the processor level, it cannot be decided whether denoised is a required feature. (But still, if the workflow included a denoising step earlier it would be quite surprising if no denoised images are available after re-cropping, for example.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants