-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALTO support for encoding OCR segmentation ambiguity #63
Comments
Note: Above link was merely the example, the actual proposal was a few comments before that. In my view, the most significant progress over that was the aspect of backwards compatibility ("lattice opt-in") which had been brought up by @artunit and for which @cipriandinu delivered a viable solution: To not let <xsd:element name="TextLine" maxOccurs="unbounded">
... <!-- documentation -->
<xsd:complexType>
<xsd:sequence>
... <!-- Shape sequence -->
<!-- old representation: -->
<xsd:sequence maxOccurs="unbounded">
<xsd:sequence>
<xsd:element name="String" type="StringType"/>
<xsd:element name="SP" type="SPType" minOccurs="0"/>
</xsd:sequence>
<xsd:element name="HYP" minOccurs="0">
... <!-- HYP type -->
</xsd:element>
<!-- new representation: -->
<xsd:element name="Lattice" type="LatticeType" minOccurs="0"/>
</xsd:sequence>
... <!-- attributes etc -->
</xsd:complexType>
...
</xsd:element> for global (word segmentation and glyph segmentation) ambiguity and <xsd:complexType name="StringType" mixed="false">
... <!-- documentation -->
<xsd:sequence minOccurs="0">
<!-- old representation: -->
<xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
<xsd:element name="ALTERNATIVE" type="ALTERNATIVEType" minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="Glyph" type="GlyphType" minOccurs="0" maxOccurs="unbounded"/>
<!-- new representation: -->
<xsd:element name="Lattice" type="LatticeType" minOccurs="0"/>
</xsd:sequence>
... <!-- attributes etc -->
</xsd:complexType> for local (glyph segmentation) ambiguity. The actual <xsd:complexType name="LatticeType">
<xsd:sequence>
<!-- re-use GlyphType for nodes (but allow it to be used for white space as well): -->
<xsd:element name="Glyph" type="GlyphType" maxOccurs="unbounded"/>
<!-- introduce a simple edge type: -->
<xsd:element name="Span" type="SpanType" minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="ID" type="xs:ID" use="optional"/>
<!-- for convenience, summarize initial nodes (yes, plural here): -->
<xsd:attribute name="begin" type="xs:IDREFS" use="required"/>
<!-- for convenience, summarize terminal nodes (yes, plural here): -->
<xsd:attribute name="end" type="xs:IDREFS" use="required"/>
</xsd:complexType>
<xsd:complexType name="SpanType">
<!-- ID of incoming Glyph: -->
<xsd:attribute name="begin" type="xs:IDREF" use="required"/>
<!-- ID of outgoing Glyph: -->
<xsd:attribute name="end" type="xs:IDREF" use="required"/>
</xsd:complexType> Practically, since IDREFs have document-wide scope, the Regarding the fear of "exploding" file size, lattices can always be pruned to reduce the size (and precision). This is a trade-off which users will have to make based on their use-case. (And it's a trade-off that n-best lists offer, too, but extremely inefficient because they don't avoid the combinatorial explosion.) However, all this is also related to #54, because:
So we should wait for that issue to be resolved on its own, and then continue here. BTW, I liked the old title better. Now that the |
Looks good! I have changed the title. |
An additional idea might be to give some probabilities for each Span: <xsd:complexType name="SpanType"> <xsd:attribute name="begin" type="xs:IDREF" use="required"/> <xsd:attribute name="end" type="xs:IDREF" use="required"/> <xsd:attribute name="probability" ref="probability_attr" use="optional" <xs:attribute name="probability_attr"> The highest probabilities product should lead to the best option encoded as string/textline. On the other hand these probabilities represent the initial assessment made by recognition engine and an ALTO processor may ignore these and apply its own probabilities that may lead to a different best option (real benefit is that in time, based on same Lattice, better models can be applied in order to rebuild the probabilities and generate a different and maybe more accurate best result) |
Thanks @artunit! To complement the discussion on a possible lattice extension to fully represent OCR ambiguity with a perspective more inclined to an extension based on the confidence/confusion matrix (henceforth, confmat), here are some arguments provided by @gundramleifert, grouped by different aspects: Technological dependency
Still unclear:
GeneralityA lattice can be used to represent a confmat, too: by adding or prolonging a character edge for each output channel at each position. The null/not-a-character channel must be a special or gap symbol. So it's possible, but extremely inefficient – yet again, one could apply pruning to the desired degree. InterdependenceWith RNN+CTC, lattices are generated from the confmat via CTC approximation. Lattices often implicitly rely on a language model (LM) for pruning, whereas confmats don't. Once you have pruned a lattice with some LM, you lose some information (which even a larger/better LM could not recover). With confmats on the other hand, decoding and LM scoring are always defered to the runtime (so you can still decide on the LM later without loosing information). Decoding the confmat might require knowledge of the specific encoding of the original model at runtime. (Channels are not necessarily idempotent to characterset, especially for large scripts.) And the runtime cost of CTC approximation for confmats will usually be larger than the cost of Viterbi search for lattices. Memory efficiencyWithout information loss
With information loss
|
Thanks so much for this analysis! At the 2019-12-13 ALTO Board Meeting, the idea was put forward that we do a single-topic meeting on this discussion in January 2020 since it is high priority and also connects to the important confidence value issue. I will follow up with you and @gundramleifert to see if there is opportune timing in the last 2 weeks of January (Jan. 20-) for us to put together a virtual gathering. If we can make the arrangements, I will post the zoom link here. |
@bertsky Can you give a link to the confusion matrix discussion? And/or show how a confusion matrix would be encoded in an XSD/sample XML? |
At the 2019-05-07 ALTO face-to-face Board Meeting there was support for identifying a a standard and interoperable lattice structure for encoding OCR uncertainty and alternative hypotheses, and the discussion since then has largely been attached to issue 57 - Allow a String to contain alternative Glyph segmentation hypotheses. Significant progress has been made, including an extensive proposal from @bertsky, and with momentum from the 2019-11-03 f2f meeting, it makes sense to move the discussion to a new issue. Further updates will be tracked here.
The text was updated successfully, but these errors were encountered: