Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Processor interface #58

Open
stweil opened this issue Sep 11, 2024 · 2 comments
Open

[feature request] Processor interface #58

stweil opened this issue Sep 11, 2024 · 2 comments

Comments

@stweil
Copy link
Contributor

stweil commented Sep 11, 2024

While using ocrd-fileformat-transform, I was thinking about some modifications which I'd like to discuss:

  • Wouldn't ocrd-transform be sufficient as name instead of the current lengthy one?
  • from-to is a strange parameter name. I see no technical need to inherit the specific needs of ocr-fileformat here and would prefer two parameters from and to. Are there use cases where from cannot be omitted? Technically it should be possible to derive it for all entries in the input file group, and typically it will simply be PAGE. So most users will only have to provide a to parameter or – if it defaults to `ALTO – no parameter at all.
@bertsky
Copy link
Collaborator

bertsky commented Feb 11, 2025

  • Wouldn't ocrd-transform be sufficient as name instead of the current lengthy one?

That would be too generic IMO. Might even be something that core itself will provide in the future.

We follow the pattern that processor names are indicative of their respective repository.

  • from-to is a strange parameter name. I see no technical need to inherit the specific needs of ocr-fileformat here and would prefer two parameters from and to. Are there use cases where from cannot be omitted? Technically it should be possible to derive it for all entries in the input file group, and typically it will simply be PAGE. So most users will only have to provide a to parameter or – if it defaults to `ALTO – no parameter at all.

The reason we merged ocr-transform's two arguments into one was simply that we can enforce selection of a supported combination during parameter validation.

Guessing the input type from the fileGrp is not going to be robust:

  • MIME type is insufficient, because
    • ALTO can be text/xml or application/xml+alto
    • ALTO namespace versions would need to be retrieved from the actual files, anyway
    • AFAIK there is no standard MIME type for abbyy, so it would co-occur with ALTO (text/xml)
    • same for hocr, which would just be text/html I guess
    • gcv and textract would both be application/json
  • looking at the files themselves, how to deal with non-homogeneity (multiple types/versions in the same group)?

We also have the important question of backwards compatibility.

Thus I advise against any such changes.

@kba
Copy link
Member

kba commented Feb 12, 2025

I understand that the design decisions of naming processors and parameters might not be ideal, but they are at least consistent. Changing those now would break existing workflows without offering a functional improvement.

@stweil I am curious: What are you using ocrd_fileformat for? IIUC you're mostly using tesseract and kraken directly @UB-Mannheim which have support for different output formats. I only use ocr(d)_fileformat for plain text output and very rarely to try out something with hOCR but for the bulk of conversions (PAGE to ALTO) we're using the processor of https://github.com/OCR-D/page-to-alto directly which is much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants