[feature request] Processor interface #58

stweil · 2024-09-11T18:07:31Z

While using ocrd-fileformat-transform, I was thinking about some modifications which I'd like to discuss:

Wouldn't ocrd-transform be sufficient as name instead of the current lengthy one?
from-to is a strange parameter name. I see no technical need to inherit the specific needs of ocr-fileformat here and would prefer two parameters from and to. Are there use cases where from cannot be omitted? Technically it should be possible to derive it for all entries in the input file group, and typically it will simply be PAGE. So most users will only have to provide a to parameter or – if it defaults to `ALTO – no parameter at all.

The text was updated successfully, but these errors were encountered:

bertsky · 2025-02-11T12:09:39Z

Wouldn't ocrd-transform be sufficient as name instead of the current lengthy one?

That would be too generic IMO. Might even be something that core itself will provide in the future.

We follow the pattern that processor names are indicative of their respective repository.

from-to is a strange parameter name. I see no technical need to inherit the specific needs of ocr-fileformat here and would prefer two parameters from and to. Are there use cases where from cannot be omitted? Technically it should be possible to derive it for all entries in the input file group, and typically it will simply be PAGE. So most users will only have to provide a to parameter or – if it defaults to `ALTO – no parameter at all.

The reason we merged ocr-transform's two arguments into one was simply that we can enforce selection of a supported combination during parameter validation.

Guessing the input type from the fileGrp is not going to be robust:

MIME type is insufficient, because
- ALTO can be text/xml or application/xml+alto
- ALTO namespace versions would need to be retrieved from the actual files, anyway
- AFAIK there is no standard MIME type for abbyy, so it would co-occur with ALTO (text/xml)
- same for hocr, which would just be text/html I guess
- gcv and textract would both be application/json
looking at the files themselves, how to deal with non-homogeneity (multiple types/versions in the same group)?

We also have the important question of backwards compatibility.

Thus I advise against any such changes.

kba · 2025-02-12T12:23:05Z

I understand that the design decisions of naming processors and parameters might not be ideal, but they are at least consistent. Changing those now would break existing workflows without offering a functional improvement.

@stweil I am curious: What are you using ocrd_fileformat for? IIUC you're mostly using tesseract and kraken directly @UB-Mannheim which have support for different output formats. I only use ocr(d)_fileformat for plain text output and very rarely to try out something with hOCR but for the bulk of conversions (PAGE to ALTO) we're using the processor of https://github.com/OCR-D/page-to-alto directly which is much faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Processor interface #58

[feature request] Processor interface #58

stweil commented Sep 11, 2024

bertsky commented Feb 11, 2025

kba commented Feb 12, 2025

[feature request] Processor interface #58

[feature request] Processor interface #58

Comments

stweil commented Sep 11, 2024

bertsky commented Feb 11, 2025

kba commented Feb 12, 2025