Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Group USE syntax too strict? #123

Closed
mikegerber opened this issue Aug 6, 2019 · 14 comments
Closed

File Group USE syntax too strict? #123

mikegerber opened this issue Aug 6, 2019 · 14 comments
Assignees

Comments

@mikegerber
Copy link

The spec mandates ("MUST") the use of one of IMG/GT/OCR/COR as a WORKFLOW_STEP in the USE attribute of a mets:fileGrp. This seems overly strict in these use cases:

  • [seuretm/ocrd_typegroups_classifier] produces a font identification estimate for e.g. pages or in the future lines
  • My own tool (WIP) produces OCR evaluation results for GT vs OCR

While I could put both into the OCR "WORKFLOW_STEP" they are not really "OCR produced from image" and so I'd like to have more freedom with the USE attribute.

Ideas:

  • Allow some kind of OTHER WORKFLOW_STEP
  • Downgrade the naming convention to a SHOULD instead of a MUST
  • workspace validate should at most issue a warning, not an error

(I've only done some superficial research on whether this stuff above belongs into <mets:structMap TYPE="PHYSICAL"> in the first place. I think it does, as they are metadata that pertain to physical pages.)

@bertsky
Copy link
Collaborator

bertsky commented Aug 6, 2019

(I've only done some superficial research on whether this stuff above belongs into <mets:structMap TYPE="PHYSICAL"> in the first place. I think it does, as they are metadata that pertain to physical pages.)

IIUC, every (image or PAGE) file that represents one complete page does belong there. (Thus, PAGE output will always have to be added there as well, while images only on the page level, not if they merely belong to some lower hierarchy level's AlternativeImage.

Regarding the relaxation of OCR-D mets:fileGrp/@USE conventions from MUST to SHOULD: I concur!

Additional examples: ocrd_keraslm produces differently rated text (TextEquiv/@conf) or even removes TextEquiv alternatives that are not on the best path – what USE would that be in the current scheme? And, more generally, as argued earlier, the more complicated a workflow configuration gets, the more one will be forced to differentiate USE descriptions, anyway.

@cneud cneud self-assigned this Aug 8, 2019
@mikegerber
Copy link
Author

mikegerber commented Nov 6, 2019

Another issue with the naming convention: I like to use USE names similiar to what @bertsky uses in his workflow: e.g. OCR-D-IMG-BINPAGE-sauvola-SEG-LINE-sbb for a line segmentation using the method developed at SBB on a Sauvola binarization or OCR-D-OCR-TESS-frk+deu-BINPAGE-sauvola-SEG-LINE-sbb for an OCR using Tesseract's frk+deu models on the line segmentation using the method developed at SBB on a Sauvola binarization.

Spec seems overly – and needlessy – strict with its naming convention and does not cover these "advanced" use cases with a lot of configurations to disambiguate.

Workaround is to use ocrd workspace validate -s mets_file_group_names mets.xml. (I use validate a lot so I encounter this stuff every day.)

@kba
Copy link
Member

kba commented Nov 6, 2019

I agree, the naming convention is way too elaborate. We should soften the language and validator. @cneud @wrznr

@cneud
Copy link
Member

cneud commented Nov 6, 2019

I too can see the need to relax the file group USE naming convention, as long as we ensure/are aware that:

a) there are no two OCR-D processors writing to same file group USE, potentially overwriting results from another OCR-D processor

b) OCR-D processors further down the workflow chain know which file group USE to take as input (this one is basically fine since we anyway demand explicit invocation)

@mikegerber
Copy link
Author

a) there are no two OCR-D processors writing to same file group USE, potentially overwriting results from another OCR-D processor

I personally only used a default output file group by accident. I'm not even sure if any processor uses something else than OUTPUT by default.

In other words: Is there anyone who doesn't use explicit -I and -O parameters?

@kba
Copy link
Member

kba commented Nov 6, 2019

In other words: Is there anyone who doesn't use explicit -I and -O parameters?

Not deliberately I think. With multiple output groups that can happen

@bertsky
Copy link
Collaborator

bertsky commented Nov 6, 2019

With multiple output groups that can happen

Yes, and it happens all the time, even between steps competing for the same (fallback) image file groups.

For example, ocrd-cis-ocropy-dewarp will usually create images under OCR-D-IMG-DEWARP, regardless of wether it received its input from deskewed or undeskewed images, from resegmented or rectangular line coords etc. It uses workspace.save_image_file() for its secondary output, so the existing image files will be silently overwritten. Other processors might use workspace.add_file() directly (without the force option), so they might raise an exception.

What we have here are side effects causing irreproducable workflows. I have already brought this up here, but at that time image file groups were completely implicit. Now that we mostly have the opt-in for a second output file group, the burden is only shifted to the user. (Also, you cannot directly see which file groups belong together anymore, and even processor validation would have a hard time validating image filegroups in the input and output.)

Since we are already talking about relaxing the strict naming scheme here, how about the following (radical) solution: Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page). This would effectively render all OCR-D-IMG-* conventions obsolete (except for the original OCR-D-IMG itself).

@bertsky
Copy link
Collaborator

bertsky commented Nov 6, 2019

(BTW, the new explicit image file groups have not been documented or described in the tool.json anywhere yet. See OCR-D/ocrd_tesserocr#61.)

@EEngl52
Copy link

EEngl52 commented Apr 6, 2020

@cneud can this be closed?

@cneud
Copy link
Member

cneud commented Apr 6, 2020

@EEngl52 Not yet, this will require a decision and according revision of spec.

@kba
Copy link
Member

kba commented Apr 6, 2020

The restriction on file group syntax has been relaxed since 3.4.2 (MUST -> SHOULD). So the initial problem has been adressed.

However,

Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page).

That would be a reasonable approach. @cneud @VolkerHartmann @tboenig What do you think?

About the multiple ouput file groups: Would it make se thense to make input/output file group parameters mandatory and validate that processors consuming/producing multiple groups should also be passed multiple groups? Users tend to get confused about the multi-value-with-comma-syntax, having explicit help messages when the syntax or cardinality is wrong, might be more helpful than having the information only in the CLI spec.

@VolkerHartmann
Copy link
Contributor

I agree that static default values for input and output file groups may be confusing.
Multiple output file groups are always hard to handle, as it is not mandatory that the first output file group contain the PAGE xml. In the description, there is (usually?) no indication which group contains which data.
Writing all in one file group may be easier to handle. Therefore, we have to change the specification because now all files of a file group must have the same mimetype.
A better solution may be:
We should make the input/output file groups mandatory with defaults for the postfix of additional output file groups.
E.g. OCR-D-DEWARP --> OCR-D-DEWARP (first group must contain PAGE), OCR-D-DEWARP**-IMG** (images)
Doing so the ocrd-tool.json file have to contain the postfixes for the additional output file groups.
If we solve it this way, there is at least no confusion with multiple output groups.
For the input file groups number/range of possible file groups would be awesome. E.g.: 1, 2 or 2-5
With a short description for each file group.

Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page)

I wouldn't do so.

  1. Contradicts our specification (one mimetype per group)
  2. In case you want to know how the result for a page was generated, you simply have to copy all files listed in the structMap without parsing all PAGE files.

@bertsky
Copy link
Collaborator

bertsky commented Jun 17, 2020

Although this proposal has since been agreed upon, for the sake of completeness, allow me to address this response:

I agree that static default values for input and output file groups may be confusing.

This is not about implicit defaults for secondary output file groups. This has already been working fine.

Multiple output file groups are always hard to handle, as it is not mandatory that the first output file group contain the PAGE xml. In the description, there is (usually?) no indication which group contains which data.

This is also not about which output fileGrp position has what meaning. Implementors and users have been coping well with that, too.

Writing all in one file group may be easier to handle.

Not all files referenced by the PAGE will be in one fileGrp, only the newly added derived images. (AlternativeImages from previous steps may still be relevant further down, and it does not make sense to copy them into the new fileGrp each time.)

Therefore, we have to change the specification because now all files of a file group must have the same mimetype.

Exactly, at this point, a provision of the specification has to be abandoned – this is a breaking change. (However, judging from the processors I have seen, not much code has ever relied on the assumption that a fileGrp can have only one MIME type.)

A better solution may be:
We should make the input/output file groups mandatory with defaults for the postfix of additional output file groups.
E.g. OCR-D-DEWARP --> OCR-D-DEWARP (first group must contain PAGE), OCR-D-DEWARP**-IMG** (images)
Doing so the ocrd-tool.json file have to contain the postfixes for the additional output file groups.
If we solve it this way, there is at least no confusion with multiple output groups.

Using suffixes on the PAGE fileGrp instead of the OCR-D-IMG-* convention could reduce some of the conflicts, but as soon as you try to support --overwrite or selective --page-id it becomes difficult. Besides, my proposal simplifies both processor implementations and workflow configurations (because output fileGrps would usually become single-valued again). However, the suffix solution does not – except if you make this completely implicit again, which bears other dangers. (What if a user does not expect/know that convention, and re-uses the same suffix for her own purposes?)

Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page)

  • Contradicts our specification (one mimetype per group)

Changes the specification, yes (see above). There are no actual contradictions though (i.e. nothing becomes inconsistent).

  • In case you want to know how the result for a page was generated, you simply have to copy all files listed in the structMap without parsing all PAGE files.

IIUC you are referring to the question of whether or not derived images should be in the physical structMap. This is actually an independent decision: I thought that only allowing the PAGE in the structMap (except for the input / OCR-D-IMG) from now on would be more fitting to the "derived-image-in-output-filegrp" scheme. But one could also argue, as you do, that referencing all derived images along with the original image and PAGE under the physical page does help you find all parts that belong to a page (under any fileGrp).
also argue that

@kba
Copy link
Member

kba commented Jul 13, 2020

Was closed in a commit, if there is still discussion to be done, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants