A wrapper to work with Tesseract OCR inside PHP.
First of all, make sure you have Tesseract OCR installed. (v3.03 or greater)
As a Composer dependency
{
"require": {
"thiagoalessio/tesseract_ocr": "1.2.0"
}
}
Given the following image:
And the following code:
echo (new TesseractOCR('text.png'))
->run();
Produces:
The quick brown fox
jumps over the lazy
dog.
Given the following image:
And the following code:
echo (new TesseractOCR('german.png'))
->run();
Produces griifien
.
Which is not good, but defining a language:
echo (new TesseractOCR('german.png'))
->lang('deu')
->run();
Produces grüßen
.
Given the following image:
And the following code:
echo (new TesseractOCR('multi-languages.png'))
->lang('eng', 'jpn', 'por')
->run();
Produces I eat 寿司 de maçã
.
Given the following image:
And the following code:
echo (new TesseractOCR('8055.png'))
->whitelist(range('A', 'Z'))
->run();
Produces BOSS
.
Yes, I know some of you might want to use this library for the noble purpose of breaking CAPTCHAs, so please take a look on this comment:
Define a custom location of the tesseract
executable,
if by any reason it is not present in the $PATH
.
echo (new TesseractOCR('img.png'))
->executable('/path/to/tesseract')
->run();
Specify a custom location for the tessdata directory.
echo (new TesseractOCR('img.png'))
->tessdataDir('/path')
->run();
Specify the location of user words file.
This is a plain text file containing a list of words that you want to be
considered as a normal dictionary words by tesseract
.
Useful when dealing with contents that contain technical terminology, jargon, etc.
$ cat /path/to/user-words.txt
foo
bar
echo (new TesseractOCR('img.png'))
->userWords('/path/to/user-words.txt')
->run();
Specify the location of user patterns file.
If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.
$ cat /path/to/user-patterns.txt'
1-\d\d\d-GOOG-441
www.\n\\\*.com
echo (new TesseractOCR('img.png'))
->userPatterns('/path/to/user-patterns.txt')
->run();
Define one or more languages to be used during the recognition. A complete list of available languages can be found here.
Tip from @daijiale: Use the combination ->lang('chi_sim', 'chi_tra')
for proper recognition of Chinese.
echo (new TesseractOCR('img.png'))
->lang('lang1', 'lang2', 'lang3')
->run();
Specify the Page Segmentation Mode, which instructs tesseract
how to
interpret the given image.
Possible psm
values are:
Value | Description |
---|---|
0 | Orientation and script detection (OSD) only. |
1 | Automatic page segmentation with OSD. |
2 | Automatic page segmentation, but no OSD, or OCR. |
3 | Fully automatic page segmentation, but no OSD. (Default) |
4 | Assume a single column of text of variable sizes. |
5 | Assume a single uniform block of vertically aligned text. |
6 | Assume a single uniform block of text. |
7 | Treat the image as a single text line. |
8 | Treat the image as a single word. |
9 | Treat the image as a single word in a circle. |
10 | Treat the image as a single character. |
echo (new TesseractOCR('img.png'))
->psm(6)
->run();
This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....')
.
echo (new TesseractOCR('img.png'))
->whitelist(range('a', 'z'), range(0, 9), '-_@')
->run();
Tesseract offers incredible control to the user through its 600+ configuration options. You can see the complete list by running the following command:
$ tesseract --print-parameters
Tesseract parameters:
... long list with all parameters ...
echo (new TesseractOCR('img.png'))
->config('config_var', 'value')
->config('other_config_var', 'other value')
->run();
// or better yet, just cammel case any of the options:
echo (new TesseractOCR('img.png'))
->configVar('value')
->otherConfigVar('other value')
->run();
#tesseract-ocr-for-php
on freenode IRC.