-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
expected
write languages parameter value to hocr output file
<html ...>
<head>
<!-- ... -->
<meta name='ocr-languages' content='deu+eng+rus'/>
</head>
why
this is useful for OCR proofreading with hocr-editors
to run tesseract again on selected regions with the original tesseract arguments
alternative
write all tesseract arguments to hocr output file
but this is harder to parse
<html ...>
<head>
<!-- ... -->
<meta name='ocr-arguments' content='tesseract src.jpg - -l deu+eng+rus hocr'/>
</head>
this would be useful to also preserve CLI arguments like
--oem 1 --psm 6 --tessdata-dir tessdata_best
assuming tesseract is run in the same workdir
workaround
guess the languages parameter value
from lang='...'
attributes in the hocr output file
parse the main language from
<p class='ocr_par' id='[^']+' lang='([^']+)'
and parse extra languages from
<span class='ocrx_word' id='[^']+' title='[^']+' lang='([^']+)'
but this workaround can fail and return a wrong order of languages
(my impression is that the order of languages does matter for tesseract)
parse
yeah i know, "parsing" xml with regex is bad
this is just an example, in a real app i would use a proper xml parser
example
tesseract was called like
tesseract src.jpg - -l deu+eng+rus hocr >dst.hocr
then the main language is deu
and the extra languages are eng
and rus
keywords
- get tesseract languages parameter value from hocr file