An Open-Source Library for Processing Documents in Apache Spark.
Source Code: https://github.com/StabRise/ScaleDP
Quickstart: 1.QuickStart.ipynb
Tutorials: https://github.com/StabRise/ScaleDP-Tutorials
ScaleDP is library allows you to process documents using Apache Spark. Discover pre-trained models for your projects or play with the thousands of machine learning apps hosted on the Hugging Face Hub.
- Load PDF documents/Images to the Spark DataFrame
- Extract text from PDF documents/Images
- Extract images from PDF documents
- Create document processing pipelines
- OCR Images/PDF documents using various OCR engines
- OCR Images/PDF documents using Vision LLM models
- Object detection on images
- Text detection on images
- Extract data from the image using Vision LLM models
- Extract data from the text/images using LLM models
- Extract data from using DSPy framework
- Extract data from the text/images using NLP models from the Hugging Face Hub
- Visualize results
Support various open-source OCR engines:
- Python 3.10 or higher
- Apache Spark 3.5 or higher
- Java 8
- Tesseract 4.0 or higher
Install the ScaleDP
package with pip:
pip install scaledp
Build image:
docker build -t scaledp .
Run container:
docker run -p 8888:8888 scaledp:latest
Open Jupyter Notebook in your browser:
http://localhost:8888
Start a Spark session with ScaleDP:
from scaledp import *
spark = ScaleDPSession()
spark
Read example image file:
image_example = files('resources/images/Invoice.png')
df = spark.read.format("binaryFile") \
.load(image_example)
df.show_image("content")
Output:
Define pipeline for extract text from the image and run NER:
pipeline = PipelineModel(stages=[
DataToImage(inputCol="content", outputCol="image"),
TesseractOcr(inputCol="image", outputCol="text", psm=PSM.AUTO, keepInputData=True),
Ner(model="obi/deid_bert_i2b2", inputCol="text", outputCol="ner", keepInputData=True),
ImageDrawBoxes(inputCols=["image", "ner"], outputCol="image_with_boxes", lineWidth=3,
padding=5, displayDataList=['entity_group'])
])
result = pipeline.transform(df).cache()
result.show_text("text")
Output:
Show NER results:
result.show_ner(limit=20)
Output:
+------------+-------------------+----------+-----+---+--------------------+
|entity_group| score| word|start|end| boxes|
+------------+-------------------+----------+-----+---+--------------------+
| HOSP| 0.991257905960083| Hospital| 0| 8|[{Hospital:, 0.94...|
| LOC| 0.999171257019043| Dutton| 10| 16|[{Dutton,, 0.9609...|
| LOC| 0.9992585778236389| MI| 18| 20|[{MI, 0.93335297,...|
| ID| 0.6838774085044861| 26| 29| 31|[{26-123123, 0.90...|
| PHONE| 0.4669836759567261| -| 31| 32|[{26-123123, 0.90...|
| PHONE| 0.7790696024894714| 123123| 32| 38|[{26-123123, 0.90...|
| HOSP|0.37445762753486633| HOPE| 39| 43|[{HOPE, 0.9525460...|
| HOSP| 0.9503226280212402| HAVEN| 44| 49|[{HAVEN, 0.952546...|
| LOC| 0.9975488185882568|855 Howard| 59| 69|[{855, 0.94682700...|
| LOC| 0.9984399676322937| Street| 70| 76|[{Street, 0.95823...|
| HOSP| 0.3670221269130707| HOSPITAL| 77| 85|[{HOSPITAL, 0.959...|
| LOC| 0.9990363121032715| Dutton| 86| 92|[{Dutton,, 0.9647...|
| LOC| 0.999313473701477| MI 49316| 94|102|[{MI, 0.94589012,...|
| PHONE| 0.9830010533332825| ( 123 )| 110|115|[{(123), 0.595334...|
| PHONE| 0.9080978035926819| 456| 116|119|[{456-1238, 0.955...|
| PHONE| 0.9378324151039124| -| 119|120|[{456-1238, 0.955...|
| PHONE| 0.8746233582496643| 1238| 120|124|[{456-1238, 0.955...|
| PATIENT|0.45354968309402466|hopedutton| 132|142|[{hopedutton@hope...|
| EMAIL|0.17805588245391846| hopehaven| 143|152|[{hopedutton@hope...|
| HOSP| 0.505658745765686| INVOICE| 157|164|[{INVOICE, 0.9661...|
+------------+-------------------+----------+-----+---+--------------------+
Visualize NER results:
result.visualize_ner(labels_list=["DATE", "LOC"])
Original image with NER results:
result.show_image("image_with_boxes")
Bbox level | Support GPU | Separate model for text detection | Processing time 1 page (CPU/GPU) secs | Support Handwritten Text | |
---|---|---|---|---|---|
Tesseract OCR | character | no | no | 0.2/no | not good |
Tesseract OCR CLI | character | no | no | 0.2/no | not good |
Easy OCR | word | yes | yes | ||
Surya OCR | line | yes | yes | ||
DocTR | word | yes | yes |
This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.