-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Group USE syntax too strict? #123
Comments
IIUC, every (image or PAGE) file that represents one complete page does belong there. (Thus, PAGE output will always have to be added there as well, while images only on the page level, not if they merely belong to some lower hierarchy level's Regarding the relaxation of OCR-D Additional examples: |
Another issue with the naming convention: I like to use USE names similiar to what @bertsky uses in his workflow: e.g. Spec seems overly – and needlessy – strict with its naming convention and does not cover these "advanced" use cases with a lot of configurations to disambiguate. Workaround is to use |
I too can see the need to relax the file group USE naming convention, as long as we ensure/are aware that: a) there are no two OCR-D processors writing to same file group USE, potentially overwriting results from another OCR-D processor b) OCR-D processors further down the workflow chain know which file group USE to take as input (this one is basically fine since we anyway demand explicit invocation) |
I personally only used a default output file group by accident. I'm not even sure if any processor uses something else than In other words: Is there anyone who doesn't use explicit |
Not deliberately I think. With multiple output groups that can happen |
Yes, and it happens all the time, even between steps competing for the same (fallback) image file groups. For example, What we have here are side effects causing irreproducable workflows. I have already brought this up here, but at that time image file groups were completely implicit. Now that we mostly have the opt-in for a second output file group, the burden is only shifted to the user. (Also, you cannot directly see which file groups belong together anymore, and even processor validation would have a hard time validating image filegroups in the input and output.) Since we are already talking about relaxing the strict naming scheme here, how about the following (radical) solution: Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page). This would effectively render all |
(BTW, the new explicit image file groups have not been documented or described in the tool.json anywhere yet. See OCR-D/ocrd_tesserocr#61.) |
@cneud can this be closed? |
The restriction on file group syntax has been relaxed since 3.4.2 ( However,
That would be a reasonable approach. @cneud @VolkerHartmann @tboenig What do you think? About the multiple ouput file groups: Would it make se thense to make input/output file group parameters mandatory and validate that processors consuming/producing multiple groups should also be passed multiple groups? Users tend to get confused about the multi-value-with-comma-syntax, having explicit help messages when the syntax or cardinality is wrong, might be more helpful than having the information only in the CLI spec. |
I agree that static default values for input and output file groups may be confusing.
I wouldn't do so.
|
Although this proposal has since been agreed upon, for the sake of completeness, allow me to address this response:
This is not about implicit defaults for secondary output file groups. This has already been working fine.
This is also not about which output fileGrp position has what meaning. Implementors and users have been coping well with that, too.
Not all files referenced by the PAGE will be in one fileGrp, only the newly added derived images. (AlternativeImages from previous steps may still be relevant further down, and it does not make sense to copy them into the new fileGrp each time.)
Exactly, at this point, a provision of the specification has to be abandoned – this is a breaking change. (However, judging from the processors I have seen, not much code has ever relied on the assumption that a fileGrp can have only one MIME type.)
Using suffixes on the PAGE fileGrp instead of the
Changes the specification, yes (see above). There are no actual contradictions though (i.e. nothing becomes inconsistent).
IIUC you are referring to the question of whether or not derived images should be in the physical structMap. This is actually an independent decision: I thought that only allowing the PAGE in the structMap (except for the input / |
Was closed in a commit, if there is still discussion to be done, feel free to reopen. |
The spec mandates ("MUST") the use of one of IMG/GT/OCR/COR as a WORKFLOW_STEP in the USE attribute of a mets:fileGrp. This seems overly strict in these use cases:
While I could put both into the OCR "WORKFLOW_STEP" they are not really "OCR produced from image" and so I'd like to have more freedom with the USE attribute.
Ideas:
workspace validate
should at most issue a warning, not an error(I've only done some superficial research on whether this stuff above belongs into
<mets:structMap TYPE="PHYSICAL">
in the first place. I think it does, as they are metadata that pertain to physical pages.)The text was updated successfully, but these errors were encountered: