File Group USE syntax too strict? #123

mikegerber · 2019-08-06T10:30:08Z

The spec mandates ("MUST") the use of one of IMG/GT/OCR/COR as a WORKFLOW_STEP in the USE attribute of a mets:fileGrp. This seems overly strict in these use cases:

[seuretm/ocrd_typegroups_classifier] produces a font identification estimate for e.g. pages or in the future lines
My own tool (WIP) produces OCR evaluation results for GT vs OCR

While I could put both into the OCR "WORKFLOW_STEP" they are not really "OCR produced from image" and so I'd like to have more freedom with the USE attribute.

Ideas:

Allow some kind of OTHER WORKFLOW_STEP
Downgrade the naming convention to a SHOULD instead of a MUST
workspace validate should at most issue a warning, not an error

(I've only done some superficial research on whether this stuff above belongs into <mets:structMap TYPE="PHYSICAL"> in the first place. I think it does, as they are metadata that pertain to physical pages.)

The text was updated successfully, but these errors were encountered:

bertsky · 2019-08-06T14:05:15Z

(I've only done some superficial research on whether this stuff above belongs into <mets:structMap TYPE="PHYSICAL"> in the first place. I think it does, as they are metadata that pertain to physical pages.)

IIUC, every (image or PAGE) file that represents one complete page does belong there. (Thus, PAGE output will always have to be added there as well, while images only on the page level, not if they merely belong to some lower hierarchy level's AlternativeImage.

Regarding the relaxation of OCR-D mets:fileGrp/@USE conventions from MUST to SHOULD: I concur!

Additional examples: ocrd_keraslm produces differently rated text (TextEquiv/@conf) or even removes TextEquiv alternatives that are not on the best path – what USE would that be in the current scheme? And, more generally, as argued earlier, the more complicated a workflow configuration gets, the more one will be forced to differentiate USE descriptions, anyway.

mikegerber · 2019-11-06T16:00:52Z

Another issue with the naming convention: I like to use USE names similiar to what @bertsky uses in his workflow: e.g. OCR-D-IMG-BINPAGE-sauvola-SEG-LINE-sbb for a line segmentation using the method developed at SBB on a Sauvola binarization or OCR-D-OCR-TESS-frk+deu-BINPAGE-sauvola-SEG-LINE-sbb for an OCR using Tesseract's frk+deu models on the line segmentation using the method developed at SBB on a Sauvola binarization.

Spec seems overly – and needlessy – strict with its naming convention and does not cover these "advanced" use cases with a lot of configurations to disambiguate.

Workaround is to use ocrd workspace validate -s mets_file_group_names mets.xml. (I use validate a lot so I encounter this stuff every day.)

kba · 2019-11-06T16:18:45Z

I agree, the naming convention is way too elaborate. We should soften the language and validator. @cneud @wrznr

cneud · 2019-11-06T16:26:13Z

I too can see the need to relax the file group USE naming convention, as long as we ensure/are aware that:

a) there are no two OCR-D processors writing to same file group USE, potentially overwriting results from another OCR-D processor

b) OCR-D processors further down the workflow chain know which file group USE to take as input (this one is basically fine since we anyway demand explicit invocation)

mikegerber · 2019-11-06T16:37:56Z

a) there are no two OCR-D processors writing to same file group USE, potentially overwriting results from another OCR-D processor

I personally only used a default output file group by accident. I'm not even sure if any processor uses something else than OUTPUT by default.

In other words: Is there anyone who doesn't use explicit -I and -O parameters?

kba · 2019-11-06T17:23:31Z

In other words: Is there anyone who doesn't use explicit -I and -O parameters?

Not deliberately I think. With multiple output groups that can happen

bertsky · 2019-11-06T18:17:43Z

With multiple output groups that can happen

Yes, and it happens all the time, even between steps competing for the same (fallback) image file groups.

For example, ocrd-cis-ocropy-dewarp will usually create images under OCR-D-IMG-DEWARP, regardless of wether it received its input from deskewed or undeskewed images, from resegmented or rectangular line coords etc. It uses workspace.save_image_file() for its secondary output, so the existing image files will be silently overwritten. Other processors might use workspace.add_file() directly (without the force option), so they might raise an exception.

What we have here are side effects causing irreproducable workflows. I have already brought this up here, but at that time image file groups were completely implicit. Now that we mostly have the opt-in for a second output file group, the burden is only shifted to the user. (Also, you cannot directly see which file groups belong together anymore, and even processor validation would have a hard time validating image filegroups in the input and output.)

Since we are already talking about relaxing the strict naming scheme here, how about the following (radical) solution: Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page). This would effectively render all OCR-D-IMG-* conventions obsolete (except for the original OCR-D-IMG itself).

bertsky · 2019-11-06T18:20:53Z

(BTW, the new explicit image file groups have not been documented or described in the tool.json anywhere yet. See OCR-D/ocrd_tesserocr#61.)

EEngl52 · 2020-04-06T15:28:18Z

@cneud can this be closed?

cneud · 2020-04-06T15:32:13Z

@EEngl52 Not yet, this will require a decision and according revision of spec.

kba · 2020-04-06T16:51:12Z

The restriction on file group syntax has been relaxed since 3.4.2 (MUST -> SHOULD). So the initial problem has been adressed.

However,

Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page).

That would be a reasonable approach. @cneud @VolkerHartmann @tboenig What do you think?

About the multiple ouput file groups: Would it make se thense to make input/output file group parameters mandatory and validate that processors consuming/producing multiple groups should also be passed multiple groups? Users tend to get confused about the multi-value-with-comma-syntax, having explicit help messages when the syntax or cardinality is wrong, might be more helpful than having the information only in the CLI spec.

VolkerHartmann · 2020-04-07T06:30:52Z

I agree that static default values for input and output file groups may be confusing.
Multiple output file groups are always hard to handle, as it is not mandatory that the first output file group contain the PAGE xml. In the description, there is (usually?) no indication which group contains which data.
Writing all in one file group may be easier to handle. Therefore, we have to change the specification because now all files of a file group must have the same mimetype.
A better solution may be:
We should make the input/output file groups mandatory with defaults for the postfix of additional output file groups.
E.g. OCR-D-DEWARP --> OCR-D-DEWARP (first group must contain PAGE), OCR-D-DEWARP**-IMG** (images)
Doing so the ocrd-tool.json file have to contain the postfixes for the additional output file groups.
If we solve it this way, there is at least no confusion with multiple output groups.
For the input file groups number/range of possible file groups would be awesome. E.g.: 1, 2 or 2-5
With a short description for each file group.

Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page)

I wouldn't do so.

Contradicts our specification (one mimetype per group)
In case you want to know how the result for a page was generated, you simply have to copy all files listed in the structMap without parsing all PAGE files.

bertsky · 2020-06-17T21:48:04Z

Although this proposal has since been agreed upon, for the sake of completeness, allow me to address this response:

I agree that static default values for input and output file groups may be confusing.

This is not about implicit defaults for secondary output file groups. This has already been working fine.

Multiple output file groups are always hard to handle, as it is not mandatory that the first output file group contain the PAGE xml. In the description, there is (usually?) no indication which group contains which data.

This is also not about which output fileGrp position has what meaning. Implementors and users have been coping well with that, too.

Writing all in one file group may be easier to handle.

Not all files referenced by the PAGE will be in one fileGrp, only the newly added derived images. (AlternativeImages from previous steps may still be relevant further down, and it does not make sense to copy them into the new fileGrp each time.)

Therefore, we have to change the specification because now all files of a file group must have the same mimetype.

Exactly, at this point, a provision of the specification has to be abandoned – this is a breaking change. (However, judging from the processors I have seen, not much code has ever relied on the assumption that a fileGrp can have only one MIME type.)

A better solution may be:
We should make the input/output file groups mandatory with defaults for the postfix of additional output file groups.
E.g. OCR-D-DEWARP --> OCR-D-DEWARP (first group must contain PAGE), OCR-D-DEWARP**-IMG** (images)
Doing so the ocrd-tool.json file have to contain the postfixes for the additional output file groups.
If we solve it this way, there is at least no confusion with multiple output groups.

Using suffixes on the PAGE fileGrp instead of the OCR-D-IMG-* convention could reduce some of the conflicts, but as soon as you try to support --overwrite or selective --page-id it becomes difficult. Besides, my proposal simplifies both processor implementations and workflow configurations (because output fileGrps would usually become single-valued again). However, the suffix solution does not – except if you make this completely implicit again, which bears other dangers. (What if a user does not expect/know that convention, and re-uses the same suffix for her own purposes?)

Image files are always placed under the same file group as the PAGE output, only without adding them to the pageId in the physical structMap (because they don't represent the whole page)

Contradicts our specification (one mimetype per group)

Changes the specification, yes (see above). There are no actual contradictions though (i.e. nothing becomes inconsistent).

In case you want to know how the result for a page was generated, you simply have to copy all files listed in the structMap without parsing all PAGE files.

IIUC you are referring to the question of whether or not derived images should be in the physical structMap. This is actually an independent decision: I thought that only allowing the PAGE in the structMap (except for the input / OCR-D-IMG) from now on would be more fitting to the "derived-image-in-output-filegrp" scheme. But one could also argue, as you do, that referencing all derived images along with the original image and PAGE under the physical page does help you find all parts that belong to a page (under any fileGrp).
also argue that

kba · 2020-07-13T15:37:55Z

Was closed in a commit, if there is still discussion to be done, feel free to reopen.

cneud self-assigned this Aug 8, 2019

bertsky mentioned this issue Jun 23, 2020

More specifics on AlternativeImage processing #116

Open

bertsky mentioned this issue Jul 9, 2020

Ocropy processors: put derived images under output fileGrp cisocrgroup/ocrd_cis#57

Merged

kba closed this as completed in f723bf3 Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Group USE syntax too strict? #123

File Group USE syntax too strict? #123

mikegerber commented Aug 6, 2019

bertsky commented Aug 6, 2019

mikegerber commented Nov 6, 2019 •

edited

Loading

kba commented Nov 6, 2019

cneud commented Nov 6, 2019

mikegerber commented Nov 6, 2019

kba commented Nov 6, 2019

bertsky commented Nov 6, 2019

bertsky commented Nov 6, 2019

EEngl52 commented Apr 6, 2020

cneud commented Apr 6, 2020

kba commented Apr 6, 2020

VolkerHartmann commented Apr 7, 2020

bertsky commented Jun 17, 2020

kba commented Jul 13, 2020

File Group USE syntax too strict? #123

File Group USE syntax too strict? #123

Comments

mikegerber commented Aug 6, 2019

bertsky commented Aug 6, 2019

mikegerber commented Nov 6, 2019 • edited Loading

kba commented Nov 6, 2019

cneud commented Nov 6, 2019

mikegerber commented Nov 6, 2019

kba commented Nov 6, 2019

bertsky commented Nov 6, 2019

bertsky commented Nov 6, 2019

EEngl52 commented Apr 6, 2020

cneud commented Apr 6, 2020

kba commented Apr 6, 2020

VolkerHartmann commented Apr 7, 2020

bertsky commented Jun 17, 2020

kba commented Jul 13, 2020

mikegerber commented Nov 6, 2019 •

edited

Loading