Skip to content

Commit

Permalink
Merge pull request #9 from StabRise/5-add-options-for-tesseract-ocr
Browse files Browse the repository at this point in the history
Added ocrConfig option
  • Loading branch information
mykolamelnykml authored Nov 29, 2024
2 parents 93ad20c + 9fe92b4 commit b368f1b
Show file tree
Hide file tree
Showing 6 changed files with 1,613 additions and 4 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ artifactId: spark-pdf_2.12
- `resolution`: Resolution for rendering PDF page to the image. Default: "300" dpi.
- `pagePerPartition`: Number pages per partition in Spark DataFrame. Default: "5".
- `reader`: Supports: `pdfBox` - based on PdfBox java lib, `gs` - based on GhostScript (need installation GhostScipt to the system)
- `ocrConfig`: Tesseract OCR configuration. Default: "psm=3". For more information see [Tesseract OCR Params](TesseractParams.md)

## Output Columns in the DataFrame:

Expand Down
Loading

0 comments on commit b368f1b

Please sign in to comment.