Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAGE without words #5

Open
mikegerber opened this issue Apr 1, 2021 · 7 comments
Open

PAGE without words #5

mikegerber opened this issue Apr 1, 2021 · 7 comments

Comments

@mikegerber
Copy link
Contributor

mikegerber commented Apr 1, 2021

@kba asked me to put this comment from a private Gitter conversation into an issue:

bzgl. "input PAGE-XML not having words" wäre mein Input, dass ich damit leben kann wenn PAGE ohne Word-Elemente einfach nicht konvertiert werden kann. Meine Meinung wäre sogar, dass eine Wortsegementierung an dieser Stelle nicht angebracht wäre und das entweder die Layoutsegmentierung oder die OCR machen sollte. (Die OCR auch nur weil aus den CTC-Positionen eine für manche Zweke brauchbare Glyphsegmentierung als Abfallprodukt abfällt und das relativ einfach sich auf Wörter übertragen lässt, wie in ocrd_calamari)

@kba
Copy link
Member

kba commented Apr 7, 2021

cd0bb8d provides a CLI flag --(no-)skip-empty-lines which allows either skipping empty lines or creating a dummy full-width empty word (the default).

5a32ea3 provides a CLI flag --(no-)-check-words which aborts if there aren't any pc:Word in the PAGE-XML before conversion if enabled (default). This will however fail on empty pages - should I check also for any pc:TextLine present to catch that special case?

@mikegerber
Copy link
Contributor Author

5a32ea3 provides a CLI flag --(no-)-check-words which aborts if there aren't any pc:Word in the PAGE-XML before conversion if enabled (default). This will however fail on empty pages - should I check also for any pc:TextLine present to catch that special case?

Ah, the devil is in the details. Provided that ALTO allows pages without lines (I hope so), I would say: If there are pc:TextLines with non-empty and non-whitespace TextEquiv, check that there are pc:Words in them. (The extra pc:TextLines without text and pc:Words are then to be handled by --(no-)skip-empty-lines behavior.) The warning should be clear enough for users to discover that they need to provide input with pc:Words for the ALTO transformation to work as intended.

@kba
Copy link
Member

kba commented Apr 7, 2021

I've implemented the proposed behavior in 4c6b3bf:

def check_words(self):                                                                                                                                                        
   for reg_page in self.page_page.get_AllRegions(classes=['Text']):                                                                                                                
       for line_page in reg_page.get_TextLine():                                                                                                                                   
           print(line_page)                                                                                                                                                        
           textequiv = line_page.get_TextEquiv()                                                                                                                                   
           if any(x.Unicode for x in textequiv) and not line_page.get_Word():                                                                                                      
               raise ValueError("Line %s has TextEquiv but not words, so cannot be converted to ALTO without losing information. Use --no-skip-words to override" % line_page.id)  

@kba
Copy link
Member

kba commented Apr 13, 2021

In addition, the converter now also handles propagation of pc:TextEquiv of a pc:TextRegion down to a dummy pc:TextLine and to a dummy pc:Word with the --dummy-textline and --dumy-word flags:

With both those flags:

<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.
primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="OCR-D-OCR-CALAMARI_00001">       
    <pc:Page imageFilename="OCR-D-IMG/044417.jpg" imageWidth="3195" imageHeight="4370" type="content">                                                                       
        <pc:TextRegion id="r0">                                                                                                                                              
            <pc:Coords points="0,0 1,1"/>                                                                                                                                    
            <pc:TextEquiv>                                                                                                                                                   
                <pc:Unicode>CONTENT BUT NO LINES</pc:Unicode>                                                                                                                
            </pc:TextEquiv>                                                                                                                                                  
        </pc:TextRegion>                                                                                                                                                     
    </pc:Page>                                                                                                                                                               
</pc:PcGts>                                                                                                                                                                  

becomes

<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
  <Description>
    <MeasurementUnit>pixel</MeasurementUnit>
    <sourceImageInformation>
      <fileName>OCR-D-IMG/044417.jpg</fileName>
    </sourceImageInformation>
  </Description>
  <Styles/>
  <Tags/>
  <Layout>
    <Page ID="OCR-D-OCR-CALAMARI_00001" PHYSICAL_IMG_NR="0" WIDTH="4370" HEIGHT="4370">
      <TopMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
      <LeftMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
      <RightMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
      <BottomMargin VPOS="0" HPOS="0" HEIGHT="0" WIDTH="0"/>
      <PrintSpace>
        <TextBlock ID="r0" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0">
          <Shape>
            <Polygon POINTS="0,0 1,1"/>
          </Shape>
          <TextLine ID="r0-dummy-TextLine" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0">
            <Shape>
              <Polygon POINTS="0,0 1,1"/>
            </Shape>
            <String ID="r0-dummy-TextLine-dummy-Word" HEIGHT="1" WIDTH="1" HPOS="0" VPOS="0" CONTENT="CONTENT BUT NO LINES">
              <Shape>
                <Polygon POINTS="0,0 1,1"/>
              </Shape>
            </String>
          </TextLine>
        </TextBlock>
      </PrintSpace>
    </Page>
  </Layout>
</alto>

@M3ssman
Copy link

M3ssman commented Apr 28, 2021

I haven't seen any ALTO (from Tesseract) like this in 1,5 years. It is using the spatium-element SP to signal in-between whitespace. There shall be no whitespace within the CONTENT. It would be very hard not to just to split the line into words, but support each word token with proper coordinates and dimensions without knowing the font-type. Although these information is optional, it wouldn't make much sense to present stuff like this to a client viewer.

Also I wonder in which contexts PAGE like this might originate. At least not from regular OCR-D-Workflows?
Of course, in the wild-side I've seen weird PAGE-files produced by Transkribus, but that is an Transkribus issue. It comes from the data-export, where one can choose if only lines shall be included or also words.

In the context of transforming OCR-D-Pipeline-output to proper client-viewer ALTO I guess this makes no sense, therefore I'd prefer raising an exception or something alike.

@kba
Copy link
Member

kba commented Apr 28, 2021

I haven't seen any ALTO (from Tesseract) like this in 1,5 years. It is using the spatium-element SP to signal in-between whitespace. There shall be no whitespace within the CONTENT.

I understand the reasoning that it doesn't make sense to have a word with spaces in them. But from the PAGE XSD, I don't see this restriction. We do use that for the pseudo-words, i.e. a String with CONTENT being the line-level text and coordinates. I hope that ALTO consumers are robust enough to handle this.

It would be very hard not to just to split the line into words, but support each word token with proper coordinates and dimensions without knowing the font-type. Although these information is optional, it wouldn't make much sense to present stuff like this to a client viewer.

Agreed, trying to implement heuristics to derive words and their coordinates from a line-level TextEquiv is too error-prone to be worth the effort.

Also I wonder in which contexts PAGE like this might originate. At least not from regular OCR-D-Workflows?
Of course, in the wild-side I've seen weird PAGE-files produced by Transkribus, but that is an Transkribus issue. It comes from the data-export, where one can choose if only lines shall be included or also words.

Calamari also has this issue, cf. Calamari-OCR/calamari#172. @mikegerber mitigates this in ocrd_calamari though, so I unless you explicitly parameterize a processor with something like -P textequiv_level line, OCR-D output should include Words (except for empty pages obviously).

In the context of transforming OCR-D-Pipeline-output to proper client-viewer ALTO I guess this makes no sense, therefore I'd prefer raising an exception or something alike.

OK, thanks for the feedback. This should be the default behavior - if there is any line-level TextEquiv with no word-level TextEquiv, a ValueError is raised (unless overridden with --no-check-words).

@mikegerber
Copy link
Contributor Author

Calamari also has this issue, cf. Calamari-OCR/calamari#172. @mikegerber mitigates this in ocrd_calamari though, so I unless you explicitly parameterize a processor with something like -P textequiv_level line, OCR-D output should include Words (except for empty pages obviously).

Side note: -P textequiv_level line is the default, so it is the other way around: You have to explicitly ask for words (e.g. word or even glyph), otherwise the output will not contain them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants