This repository has the scripts for installing and training Tesseract 4+.
The project has the following structure:
.
|-- README.md
|-- fonts // This is where you place the font
|-- install_tesseract.sh // Script for installing Tesseract
|-- ouput // Checkpoints and model saved here
|-- train // Training Data is placed here
`-- training.sh // Script for Training Tesseract
- Instance type :
t2.large
- Storage :
40GB
- Operating System:
Ubuntu 18
- Run
./install_tesseract.sh
- Place the font in the
fonts
folder - Run
./training.sh
Two paramters that can be tuned to increase the performance of the model are:
MAX_PAGES
: This is the number of pages generated for training the model.NUM_ITERATIONS
: This is the number of times the fine tuning process will happen.
The paramters are available in the training.sh file