-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow a String to contain alternative Glyph segmentation hypotheses #57
Comments
Is the intent that the glyphs, or partial glyphs, in this case are still able to be represented in unicode? Font definitions sometimes use a unicode replacement number for symbols that can't be represented, could something similar be done for the CONTENT attribute. |
Thanks for the clarification. There has been interest in supporting multiple hypotheses in ALTO so this might be a path towards that goal. |
One approach that has come that has come up in ALTO Board discussions, and was discussed at length at a recent face-to-face meeting, is to encode multiple hypotheses within a standard and interoperable lattice structure. There seems to be reluctance to add intricate XML to support a lattice implementation and some feeling that the ALTO file should reference an external source for the lattice structure (see the very useful and relevant OCR-D issue on word segmentation ambiguity). For simplicity, here is an attempt to work through this example with a single, optional attribute called lattice. There are two functionally equivalent, and somewhat shorthand, formats for lattices that might fit into an attribute. One is called Python Lattice Format (PLT) and the other is JSON Lattice Format (JLT). This example might look something like this in JLT:
With apologies to more heavy duty lattice formats (see, for example, Lattices in Kaldi), this syntax can be used with a tool like cicada to analyse and compare lattices, and, in this case, to produce a DOT output file, which can then be used with gravizo to get a graphic rendering: This does not, in any way, eliminate for the need for a construct like StringVariantType, which carries the WC value and allows easy access to variant forms of the word through XML proper, which I think is an important provision. But this type of approach might open the door to handing off lattice formats to lattice friendly software. <String VPOS='3977' HPOS='2795' HEIGHT='118' WIDTH='157' WC='0.87' CONTENT='ever' LATTICE='[[["e",{"conf": 0.94},1]], [["v",{"conf": 0.85},1]],[["e",{"conf": 0.92},1], ["w",{"conf": 0.24}, 2]],[["r",{"conf": 0.78},1]]]'>
<!-- see above, I reversed the quoting for simplicity (which seems to be valid in XML, see https://stackoverflow.com/questions/6800467/quotes-in-xml-single-or-double) -->
</String> An attribute might not be the way to go on this, and it would make sense to allow the attribute to reference an external source even if a shorthand syntax was viable, but I like the idea that the same attribute could also conceivably be added to the SP element to support multiple spacing hypotheses, which has come up in the past and fits into the segmentation challenges of OCR. |
Thanks @artunit for this excellent response. The lattice indeed contains some of the information required to indicate alternative glyph hypotheses, and could be used to express alternatives at other levels as well (spacing). However, I see several disadvantages to using lattices:
Of the advantages, the ones I retain are:
So, I remain far from convinced that lattices are the way to go. If they are added, I would see them as an addition, not a replacement to a more native XML structure for alternatives such as the |
And thanks @urieli for the equally excellent and thoughtful "response to the response". I agree totally with the continuing need for a construct like StringVariantType and to support complete encoding of OCR within a single XML file as much as possible. I guess the big question is how extensive the lattice structure should be. There's an interesting discussion in this paper about character-lattices but I am vague on how complex the modeling requirements are. From my very limited experience, it is easy to compare multiple JLT lattices in cicada and I could see a workflow that pulls lattices from ALTO files to look for commonalities, but I am hopeful we can surface what would be needed to accommodate real-world use cases. |
Please forgive my intrusion, but I think I can help with some outsider's perspective. Let me go back a little:
It seems to me the task here is not XML support for a lattice implementation (i.e. graph processing or FST library, which generally could not accomodate anything beyond strings and confidences/weights – no coordinates, no styles etc), but rather the opposite, a lattice extension for ALTO XML. This can be done with very little extra syntax (basically, representing both nodes and edges as elements, with I do not see any reason or benefit in referencing an external source for that, or using a binary representation via an If there is reluctance (as there is for PAGE), then it should be substantiated and discussed here IMHO.
I fail to see how. Each TextLine contains a sequence of Strings each possibly followed by SP. An alternative word segmentation constitutes a different sequence – that's one level higher. And we do not want (N) sequences here, but rather (more efficiently) a lattice. The same goes for a construct like So here is my proposal: In the <!-- still a sequence: -->
<xsd:choice maxOccurs="unbounded">
<!-- old representation: -->
<xsd:sequence>
<xsd:element name="String" type="StringType"/>
<xsd:element name="SP" type="SPType" minOccurs="0"/>
</xsd:sequence>
<!-- new representation: -->
<xsd:element name="Lattice" type="LatticeType"/>
</xsd:choice> with <xsd:complexType name="LatticeType">
<xsd:sequence>
<!-- re-use GlyphType for nodes (but allow it to be used for white space as well): -->
<xsd:element name="Glyph" type="GlyphType" maxOccurs="unbounded"/>
<!-- introduce a simple edge type: -->
<xsd:element name="Span" type="SpanType" minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="ID" type="xs:ID" use="optional"/>
<!-- for convenience, summarize initial nodes (yes, plural here): -->
<xsd:attribute name="begin" type="xs:IDREFS" use="required"/>
<!-- for convenience, summarize terminal nodes (yes, plural here): -->
<xsd:attribute name="end" type="xs:IDREFS" use="required"/>
</xsd:complexType>
<xsd:complexType name="SpanType">
<!-- ID of incoming Glyph: -->
<xsd:attribute name="begin" type="xs:IDREF" use="required"/>
<!-- ID of outgoing Glyph: -->
<xsd:attribute name="end" type="xs:IDREF" use="required"/>
</xsd:complexType> So in contrast to my PAGE proposal, where text (and whitespace) segments are edges and nodes are merely positions, here text (and whitespace) segments are nodes and edges are merely connectors. The reason for the difference is that in PAGE we originally have a strict hierarchy with implicit whitespace. |
Thanks @bertsky, no need to apologize for weighing in on this. I totally agree that concerns need to substantiated and discussed in an open forum. The use case we were presented with at the last Board meeting (two days ago) was a lattice structure where nodes correspond to locations in the line, and edges correspond to character hypotheses which, in turn, are annotated with an optical character match score and a language model score. I am under no illusions that I have very much lattice expertise, but I am hoping that @acpopat can provide some more details on the use case, especially if I have mangled my description of it, and how this might fit. My only experience with lattices is very simple work with cicada so I welcome more heavy duty perspectives. |
That's precisely my use case, too! (I am doing post-correction.) The OCR and LM scores can be added/multiplied with each other (with a given weight) and annotated under But maybe I should put in a full example (not just a schema sketch), like you did? |
Thanks @bertsky - a full example would be awesome! |
Sorry @artunit, I am afraid I ran into my own trap here. It's always better to start off with an example! I had to rewrite parts of the above: We do not want to re-use Okay, here's how schema instances could look like. Starting with your above example, this could become the following: <TextLine ID='...' HPOS='...' VPOS='...' WIDTH='...' HEIGHT='...'>
<String ID='s1' VPOS='...' HPOS='...' HEIGHT='...' WIDTH='...' WC='0.99' CONTENT='Did'/>
<SP ID='s2' VPOS='...' HPOS='...' HEIGHT='...' WIDTH='...'/>
<String ID='s3' VPOS='...' HPOS='...' HEIGHT='...' WIDTH='...' WC='0.95' CONTENT='you'/>
<SP ID='s4' VPOS='...' HPOS='...' HEIGHT='...' WIDTH='...'/>
<Lattice ID='s5' begin="g1" end="g4,g5">
<Glyph ID='g1' VPOS='3977' HPOS='2795' HEIGHT='118' WIDTH='40' GC='0.94' CONTENT='e'/>
<Glyph ID='g2' VPOS='3977' HPOS='2835' HEIGHT='118' WIDTH='40' WC='0.85' CONTENT='v'/>
<Glyph ID='g3' VPOS='3977' HPOS='2875' HEIGHT='118' WIDTH='40' WC='0.92' CONTENT='e'/>
<Glyph ID='g4' VPOS='3977' HPOS='2915' HEIGHT='118' WIDTH='37' WC='0.78' CONTENT='r'/>
<Glyph ID='g5' VPOS='3977' HPOS='2875' HEIGHT='118' WIDTH='77' WC='0.24' CONTENT='w'/>
<Span begin='g1' end='g2'/>
<Span begin='g2' end='g3'/>
<Span begin='g3' end='g4'/>
<Span begin='g2' end='g5'/>
</Lattice>
...
</TextLine> Now, let's do an example which also includes word segmentation ambiguity (the same I did for PAGE): <TextLine ID='...' HPOS='...' VPOS='...' WIDTH='...' HEIGHT='...'>
<Lattice ID='s1' begin="g1,g2" end="g10">
<Glyph ID='g1' VPOS='3977' HPOS='2795' HEIGHT='118' WIDTH='40' GC='0.9' CONTENT='m'/>
<Glyph ID='g2' VPOS='3977' HPOS='2795' HEIGHT='118' WIDTH='30' GC='0.75' CONTENT='n'>
<Variant CONTENT='r' VC='0.65'/>
</Glyph>
<Glyph ID='g3' VPOS='3977' HPOS='2825' HEIGHT='118' WIDTH='10' GC='0.9' CONTENT='i'>
<Variant CONTENT='r' VC='0.6'/>
</Glyph>
<Glyph ID='g4' VPOS='3977' HPOS='2835' HEIGHT='118' WIDTH='30' GC='0.9' CONTENT='y'>
<Variant CONTENT='v' VC='0.8'/>
</Glyph>
<!-- whitespace glyph: -->
<Glyph ID='g5' VPOS='3977' HPOS='2865' HEIGHT='118' WIDTH='15' GC='0.9' CONTENT=' '/>
<!-- whitespace plus comma glyph: -->
<Glyph ID='g6' VPOS='3977' HPOS='2865' HEIGHT='118' WIDTH='20' GC='0.8' CONTENT=' ,'/>
<Glyph ID='g7' VPOS='3977' HPOS='2880' HEIGHT='118' WIDTH='30' GC='0.9' CONTENT='p'/>
<Glyph ID='g8' VPOS='3977' HPOS='2885' HEIGHT='118' WIDTH='25' GC='0.9' CONTENT='o'/>
<Glyph ID='g9' VPOS='3977' HPOS='2910' HEIGHT='118' WIDTH='40' GC='0.9' CONTENT='a'>
<Variant CONTENT='e' VC='0.7'/>
</Glyph>
<Glyph ID='g10' VPOS='3977' HPOS='2950' HEIGHT='118' WIDTH='35' GC='0.9' CONTENT='y'/>
<Span begin='g1' end='g4'/>
<Span begin='g2' end='g3'/>
<Span begin='g3' end='g4'/>
<Span begin='g4' end='g5'/>
<Span begin='g4' end='g6'/>
<Span begin='g5' end='g7'/>
<Span begin='g6' end='g8'/>
<Span begin='g7' end='g9'/>
<Span begin='g8' end='g9'/>
<Span begin='g9' end='g10'/>
</Lattice>
</TextLine> |
Excellent work @bertsky, an example makes all the difference in the world. |
Thx @artunit – I just hope we will get a vivid discussion this time... |
Hi @bertsky, this is some feedback from Reeve Ingle at Google, who has done way more heavy lifting with lattice models than I have.
Sorry that it's been a little quiet in here, the summer doesn't seem to be a busy time for github activity. |
Hi @artunit, thanks for relaying!
IMHO having only one kind of confidence score in the representation will always be enough, because you cannot separate the weight combination and the LM score calculation anyway: language models in general need a history of more than one lattice element as input (usually a sequence of varying length of previous tokens for n-gram models, or a fixed window of characters/words for RNN models), therefore you cannot represent an LM score at some lattice element independent of its partial path. Moreover, it is generally infeasible to enumerate all possible paths as LM input, so rescoring needs to prune away some partial paths at each node, which again cannot be done separately from weight combination (without loosing information/accuracy). |
Hi @bertsky - sorry for the radio silence. The summer is over, and there is a Board meeting at the end of the month, I am hoping we can get this thread moving again. |
Okay – let me know if you need any more input from my side. |
@urieli, @bertsky - At the 2019-09-27 ALTO Board meeting, there was general agreement that encoding OCR uncertainty and alternative hypotheses via a lattice, or similar model, would be a good topic for the ALTO Fall F2F gathering. The meeting will be held right before the 2019 IIIF Working meeting. The IIIF event runs November 4-7 at the University of Michigan campus in Ann Arbor, Michigan, and the F2F meeting will be held Sunday, Nov. 3 from 3 to 6 pm at the aadlfreespace room of the downtown branch of the Ann Arbor District Public Library. This location is about a 10 minute walk from the U. of Michigan campus. You are welcome to attend if this might be viable for you, or I can try to set up a virtual option. Please let me know if you are interested and we can figure out the next steps. |
@artunit Thanks for the offer! Interested yes, that's if there will be a virtual option. But I am not sure whether I can make it at that time: Living in Germany (UTC+1), which is 6h ahead of Michigan (EST / UTC-5), this will be from 9pm to midnight on a Sunday for me. Perhaps if there was a slot dedicated to lattice extension in the first hour? |
@urieli, @bertsky - I have set up a zoom meetng, if you use this link on Nov. 3, I will have a boom microphone set up and hopefully the technology will fall into place. I tried this for a meeting in Brussels, and the virtual pieces had a few glitches, but I can take some better equipment to Michigan than was possible for Belgium. Thank you both for considering this, those time zone differences can play havoc with ALTO events. |
@urieli, @bertsky - Just a reminder about the upcoming meeting on Sunday, Nov. 3 from 3 to 6 pm EST - available via zoom with this link. |
As per the 2019-11-03 meeting, the lattice discussion will be moved to issue 63 - ALTO support for encoding OCR uncertainty. StringVariantType will be brought forward to the Board for consideration. Thanks to @urieli and @bertsky for all of the work on these important issues for ALTO's evolution. |
Circling back to the original proposal from @urieli, now that the lattice proposal is part of a separate issue, it is worth restating that this is to represent different guesses at how a String should be split into entire glyphs. The <xsd:complexType name="StringVariantType" mixed="false">
<xsd:sequence minOccurs="0">
<xsd:element name="Glyph" type="GlyphType" minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="CONTENT" type="CONTENTType" use="required"/>
<xsd:attribute name="WC" type="WCType" use="optional"/>
</xsd:complexType> See the example in above comment for more detail. |
@artunit Thanks for the info, and sorry I couldn't manage to attend any of the virtual meetings. I would love to at solution allowing the encoding of multiple hypotheses with their respective confidence in a near-future version of Alto. The solution outlined here suits my immediate needs, but an XML-encoded lattice with confidence would do so as well. |
One of the most inherently difficult OCR tasks is segmenting a String into Glyphs. Because of ink or wearing problems, two glyphs can be merged on the page without any separating white space, or a single glyph can be split by white space.
As a developer of OCR software, I would like to be able to output alternative splits for a single String, with confidence attached to each split.
Alto currently provides no way of outputting these alternatives. The existing
ALTERNATIVEType
andVariantType
are not sufficient, because they only allow to express alternative content, not splits.One way to attain this would be:
This however would make it possible to define a different
HPOS
,VPOS
,HEIGHT
andWIDTH
for the String, which is not desired.Another approach would be:
Yet a third way would be to extend the existing
ALTERNATIVEType
to include confidence and glyphs:However, this implies a redefinition of
ALTERNATIVEType
, which is currently expressed asa variant of writing by new typing / spelling rules
.The text was updated successfully, but these errors were encountered: