-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No transformation from alto3.0 (from Tesseract 4.1.0) to hocr #95
Comments
After the error message in the usage information you see the available transformations. Currently there are only alto2.0/alto2.1 to hocr transformations. Try to change the namespace of your file +<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
-<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd"> and then use
I will create an issue upstream to support also other versions of alto in the transformations. |
ALTO v3 & v4 should be supported now. |
I have seen the changes because they broke our test case in one PR. Therefore we switched to a fixed commit instead of the newest version always. But this is still on my todo list. Maybe I can do this now... The different files alto*__hocr look almost identical. Therefore I would instead try to make a more generalized transformation alto__hocr which is applicable to alto files of different version. Should I try to do that as a PR? We need to change then some things than here in order to integrate the new file names, but that can be done afterwards. |
Copied from UB-Mannheim/ocr-fileformat#95 .
I can confirm the transformation of the @jtlz2's file works now with the latest version https://github.com/filak/hOCR-to-ALTO/blob/master/alto__hocr.xsl |
cf #89
The below alto file is generated from Tesseract 4.1.0 and is described in its header as alto v3.0.
ocr-validate alto-3-0 alto.xml
is successful:
I would like to convert this file to hocr for use with hocr-tools, but ocr-transform throws an error when I try*:
ocr-transform alto hocr < alto.xml
(I have also tried various combinations of alto3.0/alto-3.0/alto-3-0...)
What am I doing wrong?
Here is the alto file:
(Image from here)
The text was updated successfully, but these errors were encountered: