Skip to content

write languages parameter value to hocr output file #4455

@milahu

Description

@milahu

expected

write languages parameter value to hocr output file

<html ...>
 <head>
  <!-- ... -->
  <meta name='ocr-languages' content='deu+eng+rus'/>
 </head>

why

this is useful for OCR proofreading with hocr-editors
to run tesseract again on selected regions with the original tesseract arguments

alternative

write all tesseract arguments to hocr output file
but this is harder to parse

<html ...>
 <head>
  <!-- ... -->
  <meta name='ocr-arguments' content='tesseract src.jpg - -l deu+eng+rus hocr'/>
 </head>

this would be useful to also preserve CLI arguments like
--oem 1 --psm 6 --tessdata-dir tessdata_best
assuming tesseract is run in the same workdir

workaround

guess the languages parameter value
from lang='...' attributes in the hocr output file

parse the main language from
<p class='ocr_par' id='[^']+' lang='([^']+)'
and parse extra languages from
<span class='ocrx_word' id='[^']+' title='[^']+' lang='([^']+)'

but this workaround can fail and return a wrong order of languages
(my impression is that the order of languages does matter for tesseract)

parse

yeah i know, "parsing" xml with regex is bad
this is just an example, in a real app i would use a proper xml parser

example

tesseract was called like
tesseract src.jpg - -l deu+eng+rus hocr >dst.hocr
then the main language is deu
and the extra languages are eng and rus

keywords

  • get tesseract languages parameter value from hocr file

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions