GitHub - TransparencyToolkit/OCRServer: OCR server for hosted archiving service

This is the software for the Transparency Toolkit OCR server. It receives data from the document upload form, OCRs the documents, and saves the results.

Install the following packages:

graphicsmagick
poppler-data
poppler-utils
ghostscript
tesseract-ocr
pdftk
libreoffice
openjdk-8-jdk
openjdk-8-jre
libcurl3
libcurl3-gnutls
libcurl4-openssl-dev
libmagic-dev
unoconv

Install the following gems:

mimemagic
docsplit
curb
ruby filemagic
pry
mail
listen
rubyzip

Install Apache Tika server by downloading the .jar from https://tika.apache.org/download.html
Start Tika by running: java -jar tika-server-1.18.jar
Optionally: Install ABBYY. This is not free software, but has higher quality OCR for some file types. Images and image-style PDFs as well as office documents that fail OCR with Tika will default to using ABBYY if it is installed. A license for the command line version can be purchased at https://www.ocr4linux.com/en:pricing:start. The OCR server will default to Tesseract if ABBYY is not installed.
Setup and start https://github.com/TransparencyToolkit/DocUpload Documents need to be uploaded for the rest to work.
Set the following environment variables:

OCR_IN_PATH: The path for documents and metadata to input
OCR_OUT_PATH: The path for documents to output
PROJECT_INDEX: The index name in elastic.

In this directory (for the OCRServer), run: ruby run_ocr.rb It will then listen for new documents in OCR_IN. This must be started BEFORE documents are uploaded.

Note: To run on an existing directory, set inotify_works = false in input_output/load_files.rb

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
app		app
COPYING		COPYING
README.md		README.md
run_ocr.rb		run_ocr.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

TransparencyToolkit/OCRServer

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages