-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocrd-fileformat-transform does not add an ALTO Processing tag #35
Comments
(Alternatively, page-to-alto could add it, of course.) |
Can you provide an example of PAGE input and how you'd like to see it converted. page-to-alto should convert processing metadata, cf. https://github.com/kba/page-to-alto/blob/master/ocrd_page_to_alto/convert.py#L248-L265 |
Yes it does convert the processing metadata correctly, but does not add itself as a processing step - which would have been helpful as I was investigating whether page-to-alto was used for the conversion using ocrd-fileformat-transform. Here is an example, this was converted using ocrd-fileformat-transform: <Processing ID="ocrd-eynollah-segment-0">
<processingStepDescription>layout/segmentation/region</processingStepDescription>
<processingStepSettings>{"models": "/data/default", "dpi": "0", "full_layout": "True", "curved_line": "False", "allow_scaling": "False", "headers_off": "False"}</processingStepSettings>
<processingSoftware>
<softwareName>ocrd-eynollah-segment</softwareName>
</processingSoftware>
</Processing>
<Processing ID="ocrd-sbb-binarize-1">
<processingStepDescription>preprocessing/optimization/binarization</processingStepDescription>
<processingStepSettings>{"model": "/data/sbb_binarization/models", "operation_level": "page"}</processingStepSettings>
<processingSoftware>
<softwareName>ocrd-sbb-binarize</softwareName>
</processingSoftware>
</Processing>
<Processing ID="ocrd-tesserocr-recognize-2">
<processingStepDescription>layout/segmentation/region</processingStepDescription>
<processingStepSettings>{"model": "deu", "dpi": "0", "padding": "0", "segmentation_level": "word", "textequiv_level": "word", "overwrite_segments": "False", "overwrite_text": "True", "shrink_polygons": "False", "block_polygons": "False", "find_tables": "True", "sparse_text": "False", "raw_lines": "False", "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": "{}", "xpath_parameters": "{}", "xpath_model": "{}", "auto_model": "False", "oem": "DEFAULT"}</processingStepSettings>
<processingSoftware>
<softwareName>ocrd-tesserocr-recognize</softwareName>
</processingSoftware>
</Processing> Full PAGE + ALTO: |
What I would expect is an additional processing step like this (entirely made up): <Processing ID="ocrd-fileformat-transform-3">
<processingStepDescription>conversion</processingStepDescription>
<processingStepSettings>{"backend": "page-to-alto"}</processingStepSettings>
<processingSoftware>
<softwareName>ocrd-fileformat-transform</softwareName>
</processingSoftware>
</Processing> I know this is extra work but it's very useful to answer the question of how a file was created exactly. |
Gotcha, yes this makes sense, at least for the OCR-D processor interface. |
I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step. |
I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way. Then in ocrd-fileformat-transform, we can fully inform about the processor and its options. (While in other use cases, we might want to hide the conversion.) |
But then again, doing this from page-to-alto or
These editing commands should by done by a true XML editor, like xmlstarlet. That would have to be added to the system dependencies. Perhaps one should even offer a parameter to make this postprocessing/annotation optional. |
It does processing, so why should it not add processing info? I think it's not correct to omit it. |
It could be implemented as an option: no processing info by default (= current behaviour) or add processing info if option was given. |
I believe it would be helpful if the ocrd-fileformat-transform PAGE → ALTO transformation would add a
<Processing>
tag. I looked into to the file to figure out if https://github.com/kba/page-to-alto was used for the conversion and did not find a processing tag for the conversion, just for segmentation/binarization/OCR.The text was updated successfully, but these errors were encountered: