Add textract2page #160

bertsky · 2023-05-04T17:23:58Z

In contrast to all existing transformations, https://github.com/slub/textract2page MUST know the image file, so I also tried to make it easier for the user to know what script-args are possible/expected:

example calls for `--help-args`

> ocr-transform hocr page --help-args
        Usage: see http://www.saxonica.com/documentation/index.html#!using-xsl/commandline
        Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -jit -l -lib -license -m -nogo -now -ns -o -opt -or -outval -p -quit -r -relocate -repeat -s -sa -scmin -strip -t -T -target -TB -threads -TJ -Tlevel -Tout -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -y --?
        Use -XYZ:? for details of option XYZ
        Params: 
          param=value           Set stylesheet string parameter
          +param=filename       Set stylesheet document parameter
          ?param=expression     Set stylesheet parameter using XPath
          !param=value          Set serialization parameter

> ocr-transform gcv hocr --help-args
    Extra arguments: <width> <height>

> ocr-transform page alto --help-args
    page-to-alto options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
                                  Choose version of ALTO-XML schema to produce
                                  (older versions may not preserve all
                                  features)
  --check-words / --no-check-words
                                  Check whether PAGE-XML contains any Words
                                  and fail if not
  --check-border / --no-check-border
                                  Check whether PAGE-XML contains Border or
                                  PrintSpace
  --skip-empty-lines / --no-skip-empty-lines
                                  Whether to omit or keep empty lines in PAGE-
                                  XML
  --trailing-dash-to-hyp / --no-trailing-dash-to-hyp
                                  Whether to add a <HYP/> element if the last
                                  word in a line ends in "-"
  --dummy-textline / --no-dummy-textline
                                  Whether to create a TextLine for regions
                                  that have TextEquiv/Unicode but no TextLine
  --dummy-word / --no-dummy-word  Whether to create a Word for TextLine that
                                  have TextEquiv/Unicode but no Word
  --textequiv-index INTEGER       If multiple textequiv, use the n-th
                                  TextEquiv by @index
  --textequiv-fallback-strategy [raise|first|last]
                                  What to do if selected TextEquiv @index is
                                  not available: 'raise' will lead to a
                                  runtime error, 'first' will use the first
                                  TextEquiv, 'last' will use the last
                                  TextEquiv on the element
  --region-order [document|reading-order|reading-order-only]
                                  Order in which to iterate over the regions
  --textline-order [document|index|textline-order]
                                  Order in which to iterate over the textlines

> ocr-transform textract page --help-args
    textract2page arguments: <image-file>
    textract2page options:

kba

Great addition, thanks @bertsky and @rue-a.

Also the more detailed help for the script-args is an improvement.

You need the image as an argument because the AWS Textract JSON does not contain the image (dimensions)?

bertsky · 2023-05-05T12:47:56Z

You need the image as an argument because the AWS Textract JSON does not contain the image (dimensions)?

Exactly. Textract uses floating point ratios (0..1) for all coordinates. So even if we could live with empty or bogus @imageFilename, we need width and height to calculate the absolute coordinates everywhere.

(BTW, gcv__hocr is another case which needs width and height, but apparently it cannot derive these from the image file, so I just added width and height as script-args there.)

stweil · 2023-05-05T13:17:10Z

Thank you!

stweil · 2023-05-05T13:35:56Z

I just noticed that this PR and also a previous commit ff11c35 require a virtual environment because of pip3.
That's currently neither documented nor handled automatically in the Makefile.

bertsky · 2023-05-05T14:34:45Z

I just noticed that this PR and also a previous commit ff11c35 require a virtual environment because of pip3. That's currently neither documented nor handled automatically in the Makefile.

Indeed. I did not notice either. I would leave it to the user to set up a venv or virtualenv or conda environment though. So we would only need a few remarks in the readme IMO.

bertsky · 2023-05-05T14:40:13Z

On the other hand, we already make users set up a $HOME/.local/bin installation. It would be nice if that would suffice even for Python. For example, we could detect whether VIRTUAL_ENV is already defined, and if not, then create one under the same PREFIX at install-time, and activate it within ocr-transform at run-time.

bertsky · 2023-05-05T15:48:59Z

#162

bertsky added 5 commits May 4, 2023 16:08

add vendor/textract2page submodule

fcd412f

ocr-transform: allow passing --help-args to individual transforms

7d99faa

add textract__page (textract2page)

f5cb129

gcv__hocr: allow passing width and height

421818a

update Readme

ff4dc87

kba approved these changes May 5, 2023

View reviewed changes

stweil merged commit dd38c29 into UB-Mannheim:master May 5, 2023

stweil mentioned this pull request May 5, 2023

Support conversion from and to Textract JSON #122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add textract2page #160

Add textract2page #160

bertsky commented May 4, 2023

kba left a comment

bertsky commented May 5, 2023

stweil commented May 5, 2023

stweil commented May 5, 2023

bertsky commented May 5, 2023

bertsky commented May 5, 2023

bertsky commented May 5, 2023

Add textract2page #160

Add textract2page #160

Conversation

bertsky commented May 4, 2023

kba left a comment

Choose a reason for hiding this comment

bertsky commented May 5, 2023

stweil commented May 5, 2023

stweil commented May 5, 2023

bertsky commented May 5, 2023

bertsky commented May 5, 2023

bertsky commented May 5, 2023