-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PAGE without words #5
Comments
cd0bb8d provides a CLI flag 5a32ea3 provides a CLI flag |
Ah, the devil is in the details. Provided that ALTO allows pages without lines (I hope so), I would say: If there are |
I've implemented the proposed behavior in 4c6b3bf: def check_words(self):
for reg_page in self.page_page.get_AllRegions(classes=['Text']):
for line_page in reg_page.get_TextLine():
print(line_page)
textequiv = line_page.get_TextEquiv()
if any(x.Unicode for x in textequiv) and not line_page.get_Word():
raise ValueError("Line %s has TextEquiv but not words, so cannot be converted to ALTO without losing information. Use --no-skip-words to override" % line_page.id) |
In addition, the converter now also handles propagation of With both those flags: <pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.
primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="OCR-D-OCR-CALAMARI_00001">
<pc:Page imageFilename="OCR-D-IMG/044417.jpg" imageWidth="3195" imageHeight="4370" type="content">
<pc:TextRegion id="r0">
<pc:Coords points="0,0 1,1"/>
<pc:TextEquiv>
<pc:Unicode>CONTENT BUT NO LINES</pc:Unicode>
</pc:TextEquiv>
</pc:TextRegion>
</pc:Page>
</pc:PcGts> becomes <alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>OCR-D-IMG/044417.jpg</fileName>
</sourceImageInformation>
</Description>
<Styles/>
<Tags/>
<Layout>
<Page ID="OCR-D-OCR-CALAMARI_00001" PHYSICAL_IMG_NR="0" WIDTH="4370" HEIGHT="4370">
<TopMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
<LeftMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
<RightMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
<BottomMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
<PrintSpace>
<TextBlock ID="r0" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0">
<Shape>
<Polygon POINTS="0,0 1,1"/>
</Shape>
<TextLine ID="r0-dummy-TextLine" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0">
<Shape>
<Polygon POINTS="0,0 1,1"/>
</Shape>
<String ID="r0-dummy-TextLine-dummy-Word" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0" CONTENT="CONTENT BUT NO LINES">
<Shape>
<Polygon POINTS="0,0 1,1"/>
</Shape>
</String>
</TextLine>
</TextBlock>
</PrintSpace>
</Page>
</Layout>
</alto> |
I haven't seen any ALTO (from Tesseract) like this in 1,5 years. It is using the spatium-element Also I wonder in which contexts PAGE like this might originate. At least not from regular OCR-D-Workflows? In the context of transforming OCR-D-Pipeline-output to proper client-viewer ALTO I guess this makes no sense, therefore I'd prefer raising an exception or something alike. |
I understand the reasoning that it doesn't make sense to have a word with spaces in them. But from the PAGE XSD, I don't see this restriction. We do use that for the pseudo-words, i.e. a String with CONTENT being the line-level text and coordinates. I hope that ALTO consumers are robust enough to handle this.
Agreed, trying to implement heuristics to derive words and their coordinates from a line-level TextEquiv is too error-prone to be worth the effort.
Calamari also has this issue, cf. Calamari-OCR/calamari#172. @mikegerber mitigates this in ocrd_calamari though, so I unless you explicitly parameterize a processor with something like
OK, thanks for the feedback. This should be the default behavior - if there is any line-level TextEquiv with no word-level TextEquiv, a ValueError is raised (unless overridden with |
Side note: |
@kba asked me to put this comment from a private Gitter conversation into an issue:
The text was updated successfully, but these errors were encountered: