-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALTO output: Missing <SP> tags between <String> tags #78
Comments
Can you provide sample data and how you ran the tool? |
I guess you output the ALTO files directly from ABBYY, because we don't yet provide a transormation from ABBYY to ALTO. Then this should be an example: https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/alto/417576986_0031.xml . The |
Yes, I'll try to find out if <SP> (=space) is really necessary between <String>s in ALTO. |
I guess that it still validates without the SP tags. Moreover, most of the information (HPOS, WIDTH) can be calculated from the line above and below, but if the width of a space is important for some application, then it might be easier to have this data directly. I don't know what the VPOS information for a space says or whether it is also determined by some other values. |
On ALTO 2.1 .xsd it looks like this: <xsd:sequence maxOccurs="unbounded">
<xsd:element name="String" type="StringType"/>
<xsd:element name="SP" minOccurs="0"> ...
</xsd:element>
</xsd:sequence>
So strictly speaking it seems that <SP> is not necessary, but the <sequence> seems to imply it. |
Not sure. I only see here, that, if |
Here is an ALTO file generated with Tesseract (see tesseract-ocr/tesseract#2067). Another page was processed by ABBYY Finereader. While ABBYY adds the I am not sure whether this is a bug of the DFG viewer (and Kitodo Presentation) or whether ALTO requires explicit tags for the whitespace between words. Perhaps @sebastian-meyer or @cneud know the answer? |
The ALTO documentation says "A TextBlock is divided into lines and those are divided into strings, spaces and hyphens". I don't interpret that as a strict requirement that spaces are required, and nor does the .xsd. It's clear that spaces are required if the strings are given without |
The ALTO spec itself needs to clarify this issue. |
Clemens has created an issue for that: altoxml/schema#54 (thank you). |
Thanks for flagging this, I will put it on the agenda for our next ALTO board call which will be held November 29th. |
To chip in, I've interpreted the standard that the |
If a |
Why? Whitespace is a character like any other and personally I would've taken the decision to encode it explicitly using
ALTO luckily allows overlapping elements in constrast to PageXML. |
Then how would you encode two overlapping words if you are forced to put a |
Just have overlapping bounding boxes? Presumably there is still a reading order that determines the ordering of the |
Just to follow up - I'm afraid a quick resolve is not really around the corner...the issue was discussed in the last ALTO board call, with the core elements of the discussion summarized here. While the general feeling was that the use of An expansion of the If one really wants to be on the safe side, the quick solution right now would be to indeed include |
As a note, most of the character-based classification systems common at the time ALTO was originally specified didn't treat whitespace as a proper glyph, i.e. whitespace is just something bordered by other glyphs and is never seen by the classifier as such. This at least explains the existence of a separate |
Thank you, @cneud, @mittagessen and the ALTO board. As the current DFG viewer expects the |
The addition of the |
According to the ALTO XSD the SP tag is optional - minOccurs="0" And I do not see a way how to reliably calculate HEIGHT/WIDTH/VPOS/HPOS attributes from the hOCR data for the SP tag. IMHO - proper handling of optional SP tag should be fixed by DFG viewer. |
If the This is what I've done in the DFG-Viewer styles now. Please have a look at the current master of the DFG-Viewer at test.dfg-viewer.de. Please compare the example from above in current master and in version 5.0 of DFG-Viewer and report change requests. |
@albig IMHO the second one seems better from user perspective - it is more readable/compact. |
@albig IMHO the spacing looks better now (in master), but the linebreaks seem a bit random... |
Perhaps this is not an error.
Kind regards,
J. Barth
The text was updated successfully, but these errors were encountered: