Tesseract OCR for PHP

A wrapper to work with Tesseract OCR inside PHP.

Installation

First of all, make sure you have Tesseract OCR installed. (v3.03 or greater)

As a Composer dependency

{
    "require": {
        "thiagoalessio/tesseract_ocr": "1.2.0"
    }
}

Usage

Basic usage

Given the following image:

And the following code:

echo (new TesseractOCR('text.png'))
    ->run();

Produces:

The quick brown fox
jumps over the lazy
dog.

Other languages

Given the following image:

And the following code:

echo (new TesseractOCR('german.png'))
    ->run();

Produces griiﬁen.

Which is not good, but defining a language:

echo (new TesseractOCR('german.png'))
    ->lang('deu')
    ->run();

Produces grüßen.

Multiple languages

Given the following image:

And the following code:

echo (new TesseractOCR('multi-languages.png'))
    ->lang('eng', 'jpn', 'por')
    ->run();

Produces I eat 寿司 de maçã.

Inducing recognition

Given the following image:

And the following code:

echo (new TesseractOCR('8055.png'))
    ->whitelist(range('A', 'Z'))
    ->run();

Produces BOSS.

Breaking CAPTCHAs

Yes, I know some of you might want to use this library for the noble purpose of breaking CAPTCHAs, so please take a look on this comment:

thiagoalessio#91 (comment)

API

executable

Define a custom location of the tesseract executable, if by any reason it is not present in the $PATH.

echo (new TesseractOCR('img.png'))
    ->executable('/path/to/tesseract')
    ->run();

path

Specify a custom location for the tessdata directory.

echo (new TesseractOCR('img.png'))
    ->tessdataDir('/path')
    ->run();

userWords

Specify the location of user words file.

This is a plain text file containing a list of words that you want to be considered as a normal dictionary words by tesseract.

Useful when dealing with contents that contain technical terminology, jargon, etc.

$ cat /path/to/user-words.txt
foo
bar

echo (new TesseractOCR('img.png'))
    ->userWords('/path/to/user-words.txt')
    ->run();

userPatterns

Specify the location of user patterns file.

If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.

$ cat /path/to/user-patterns.txt'
1-\d\d\d-GOOG-441
www.\n\\\*.com

echo (new TesseractOCR('img.png'))
    ->userPatterns('/path/to/user-patterns.txt')
    ->run();

lang

Define one or more languages to be used during the recognition. A complete list of available languages can be found here.

Tip from @daijiale: Use the combination ->lang('chi_sim', 'chi_tra') for proper recognition of Chinese.

 echo (new TesseractOCR('img.png'))
     ->lang('lang1', 'lang2', 'lang3')
     ->run();

psm

Specify the Page Segmentation Mode, which instructs tesseract how to interpret the given image.

Possible psm values are:

Value	Description
0	Orientation and script detection (OSD) only.
1	Automatic page segmentation with OSD.
2	Automatic page segmentation, but no OSD, or OCR.
3	Fully automatic page segmentation, but no OSD. (Default)
4	Assume a single column of text of variable sizes.
5	Assume a single uniform block of vertically aligned text.
6	Assume a single uniform block of text.
7	Treat the image as a single text line.
8	Treat the image as a single word.
9	Treat the image as a single word in a circle.
10	Treat the image as a single character.

echo (new TesseractOCR('img.png'))
    ->psm(6)
    ->run();

whitelist

This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....').

echo (new TesseractOCR('img.png'))
    ->whitelist(range('a', 'z'), range(0, 9), '-_@')
    ->run();

Other options

Tesseract offers incredible control to the user through its 600+ configuration options. You can see the complete list by running the following command:

$ tesseract --print-parameters
Tesseract parameters:
... long list with all parameters ...

echo (new TesseractOCR('img.png'))
    ->config('config_var', 'value')
    ->config('other_config_var', 'other value')
    ->run();

// or better yet, just cammel case any of the options:

echo (new TesseractOCR('img.png'))
    ->configVar('value')
    ->otherConfigVar('other value')
    ->run();

Where to get help

#tesseract-ocr-for-php on freenode IRC.

License

Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
images		images
src		src
tests		tests
.codeclimate.yml		.codeclimate.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tesseract OCR for PHP

Installation

As a Composer dependency

Usage

Basic usage

Other languages

Multiple languages

Inducing recognition

Breaking CAPTCHAs

API

executable

path

userWords

userPatterns

lang

psm

whitelist

Other options

Where to get help

License

About

Releases

Packages

Languages

License

gbolivar/tesseract-ocr-for-php

Folders and files

Latest commit

History

Repository files navigation

Tesseract OCR for PHP

Installation

As a Composer dependency

Usage

Basic usage

Other languages

Multiple languages

Inducing recognition

Breaking CAPTCHAs

API

executable

path

userWords

userPatterns

lang

psm

whitelist

Other options

Where to get help

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages