Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add textract2page #160

Merged
merged 5 commits into from
May 5, 2023
Merged

Add textract2page #160

merged 5 commits into from
May 5, 2023

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented May 4, 2023

In contrast to all existing transformations, https://github.com/slub/textract2page MUST know the image file, so I also tried to make it easier for the user to know what script-args are possible/expected:

example calls for `--help-args`

> ocr-transform hocr page --help-args
        Usage: see http://www.saxonica.com/documentation/index.html#!using-xsl/commandline
        Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -jit -l -lib -license -m -nogo -now -ns -o -opt -or -outval -p -quit -r -relocate -repeat -s -sa -scmin -strip -t -T -target -TB -threads -TJ -Tlevel -Tout -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -y --?
        Use -XYZ:? for details of option XYZ
        Params: 
          param=value           Set stylesheet string parameter
          +param=filename       Set stylesheet document parameter
          ?param=expression     Set stylesheet parameter using XPath
          !param=value          Set serialization parameter

> ocr-transform gcv hocr --help-args
    Extra arguments: <width> <height>

> ocr-transform page alto --help-args
    page-to-alto options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
                                  Choose version of ALTO-XML schema to produce
                                  (older versions may not preserve all
                                  features)
  --check-words / --no-check-words
                                  Check whether PAGE-XML contains any Words
                                  and fail if not
  --check-border / --no-check-border
                                  Check whether PAGE-XML contains Border or
                                  PrintSpace
  --skip-empty-lines / --no-skip-empty-lines
                                  Whether to omit or keep empty lines in PAGE-
                                  XML
  --trailing-dash-to-hyp / --no-trailing-dash-to-hyp
                                  Whether to add a <HYP/> element if the last
                                  word in a line ends in "-"
  --dummy-textline / --no-dummy-textline
                                  Whether to create a TextLine for regions
                                  that have TextEquiv/Unicode but no TextLine
  --dummy-word / --no-dummy-word  Whether to create a Word for TextLine that
                                  have TextEquiv/Unicode but no Word
  --textequiv-index INTEGER       If multiple textequiv, use the n-th
                                  TextEquiv by @index
  --textequiv-fallback-strategy [raise|first|last]
                                  What to do if selected TextEquiv @index is
                                  not available: 'raise' will lead to a
                                  runtime error, 'first' will use the first
                                  TextEquiv, 'last' will use the last
                                  TextEquiv on the element
  --region-order [document|reading-order|reading-order-only]
                                  Order in which to iterate over the regions
  --textline-order [document|index|textline-order]
                                  Order in which to iterate over the textlines

> ocr-transform textract page --help-args
    textract2page arguments: <image-file>
    textract2page options:

Copy link
Collaborator

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition, thanks @bertsky and @rue-a.

Also the more detailed help for the script-args is an improvement.

You need the image as an argument because the AWS Textract JSON does not contain the image (dimensions)?

@bertsky
Copy link
Contributor Author

bertsky commented May 5, 2023

You need the image as an argument because the AWS Textract JSON does not contain the image (dimensions)?

Exactly. Textract uses floating point ratios (0..1) for all coordinates. So even if we could live with empty or bogus @imageFilename, we need width and height to calculate the absolute coordinates everywhere.

(BTW, gcv__hocr is another case which needs width and height, but apparently it cannot derive these from the image file, so I just added width and height as script-args there.)

@stweil stweil merged commit dd38c29 into UB-Mannheim:master May 5, 2023
@stweil
Copy link
Member

stweil commented May 5, 2023

Thank you!

@stweil
Copy link
Member

stweil commented May 5, 2023

I just noticed that this PR and also a previous commit ff11c35 require a virtual environment because of pip3.
That's currently neither documented nor handled automatically in the Makefile.

@bertsky
Copy link
Contributor Author

bertsky commented May 5, 2023

I just noticed that this PR and also a previous commit ff11c35 require a virtual environment because of pip3. That's currently neither documented nor handled automatically in the Makefile.

Indeed. I did not notice either. I would leave it to the user to set up a venv or virtualenv or conda environment though. So we would only need a few remarks in the readme IMO.

@bertsky
Copy link
Contributor Author

bertsky commented May 5, 2023

On the other hand, we already make users set up a $HOME/.local/bin installation. It would be nice if that would suffice even for Python. For example, we could detect whether VIRTUAL_ENV is already defined, and if not, then create one under the same PREFIX at install-time, and activate it within ocr-transform at run-time.

@bertsky
Copy link
Contributor Author

bertsky commented May 5, 2023

#162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants