Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)
Convert between Tesseract hOCR and ALTO XML 2.0/2.1 using XSL stylesheets
This project provides an installation path and command line interface for the stylesheets developed by @filak.
To install system-wide to /usr/local
:
sudo make install
To install without sudo
to your home directory:
make install PREFIX=$HOME/.local
If $HOME/.local/bin
is not in your PATH
, add this to your shell startup file (e.g. ~/.bashrc
or ~/.zshrc
):
export PATH="$HOME/.local/bin $PATH"
The web application has a PHP backed. You can deploy it on any PHP-capable
server by copying the web
folder somewhere below the document root
of your server, e.g. /var/www/html
for Apache on Debian/Ubuntu:
sudo -u www-data cp -r web /var/www/html/ocr-schema
In this example the GUI would be available under http://localhost/ocr-schema/.
The project offers two functionalities, which can be accessd via a command line script (CLI), using a web interface (GUI) or in you own tools (API)
ocr-transform
: Transformation of OCR output between OCR formatsocr-validate
: Validation of OCR output against OCR format schemas
$PREFIX/share/ocr-schemas/xslt
- XSLT stylesheets$PREFIX/share/ocr-schemas/xsd
- XSD schemas
Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]
For example, you can transform an ALTO XML to a hOCR file with:
ocr-transform alto hocr sample.xml sample.hocr
Or convert from ALTO XML (version 2.1) to hOCR with:
ocr-transform alto2.1 hocr sample.alto sample.hocr
You can also pass arguments directly to the Saxon CLI by passing them after a double dash (--
). For example, to set the foo
parameter to bar
:
ocr-transform alto hocr sample.xml sample.hocr -- foo=bar
Try ocr-transform -h
to get an overview:
Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]
Input formats:
- 'alto'
- 'hocr'
Output formats:
- 'alto2.0'
- 'alto2.1'
- 'hocr'
Saxon-HE 9.7.0.4J from Saxonica
Java version 1.7.0_95
Usage: see http://www.saxonica.com/html/documentation/using-xsl/commandline.html
Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -l -license -m -nogo -now -o -opt -or -outval -p -pack -quit -r -repeat -s -sa -scmin -strip -t -T -threads -TJ -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -xsltversion -y
Use -XYZ:? for details of option XYZ
Params:
param=value Set stylesheet string parameter
+param=filename Set stylesheet document parameter
?param=expression Set stylesheet parameter using XPath
!param=value Set serialization parameter
Select the Transform
menu option. Choose a URL, an input and an output
format. Click Transform
.
The stylesheets are installed in $PREFIX/share/ocr-schemas/xslt
and can be
used directly in your scripts and software. You will need to use an XSLT 2.0
capable stylesheet transformer.
From ╲ To | hOCR | ALTO | PAGEXML | FineReader |
---|---|---|---|---|
hOCR | - | ✅ | ✖️ | ✖️ |
ALTO | ✅ | ✖️ | ✖️ | ✖️ |
PAGE | ✖️ | ✖️ | - | ✖️ |
FineReader | ✖️ | ✖️ | ✖️ | - |
Usage: ocr-validate [-dh] <schema> <file>
For example, to validate an XML file againt the ALTO 3.1 schema:
ocr-validate alto-3-1 myFile.alto
Select the Validate
menu option. Choose a URL and an schema. Click Validate
.
The XSD files are installed under $PREFIX/share/ocr-schemas/xsd
hOCR | ALTO | PAGEXML | FineReader | |
---|---|---|---|---|
Validation | ✖️ | ✅ | ✅ | ✅ |
The XSL stylesheets for hOCR-ALTO and ALTO-hOCR transformation are licensed Creative Commons Attribution-ShareAlike 4.0 International.(CC BY-SA 4.0).
Projects included during the installation process (in ./vendor
):
- Saxon HE 9.7,
MPL
. - ALTOXML schema,
Open Source
- PAGE schemas,
?
- xsd-validator,
Apache 2.0