Skip to content

A comprehensive tutorial for OCR in python using Tesseract-OCR and OpenCV

Notifications You must be signed in to change notification settings

NanoNets/ocr-with-tesseract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

OCR in Python with OpenCV, Tesseract and Pytesseract

Companion tutorial blog post can be found here

Preprocessing of images using OpenCV

Basic functions for different preprocessing methods

  • grayscaling
  • thresholding
  • dilating
  • eroding
  • opening
  • canny edge detection
  • noise removal
  • deskwing
  • template matching.

Different methods can come in handy with different kinds of images.

Bounding box information using Pytesseract

While running and image through the tesseract OCR engine, pytesseract allows you to get bounding box imformation

  • on a character level
  • on a word level
  • based on a regex template

We will see how to obtain all of them.

Page Segmentation Modes

There are several ways a page of text can be analysed. The tesseract api provides several page segmentation modes if you want to run OCR on only a small region or in different orientations, etc.

Here's a list of the supported page segmentation modes by tesseract. Check it out here

0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

To change your page segmentation mode, change the --psm argument in your custom config string to any of the above mentioned mode codes.

Playing around with the config

By making minor changes in the config file you can

  • specify language
  • detect only digits
  • whitelist characters
  • blacklist characters
  • work with multiple language

About

A comprehensive tutorial for OCR in python using Tesseract-OCR and OpenCV

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published