Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM python:3.12.3-alpine

ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

RUN apk add --no-cache \
build-base \
python3-dev \
py3-pip \
lapack-dev \
gfortran \
libffi-dev

WORKDIR /app

COPY . /app/
COPY pyproject.toml /app/

RUN pip install poetry
RUN poetry install --no-root

CMD [ "python3", "extract_text.py" ]
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,30 @@ text = dictionary_output(pdf, sort=False, page_range=[1,2,3]) # Optional argumen

If you want more customization, check out the `pdftext.extraction._get_pages` function for a starting point to dig deeper. pdftext is a pretty thin wrapper around [pypdfium2](https://pypdfium2.readthedocs.io/en/stable/), so you might want to look at the documentation for that as well.

# Run on Docker
Clone a project
```
git clone repository

```

Build a docker image
```
cd pdftext
docker build -t pdftext .

```

Running with docker
```
# write out a text file
docker run pdftext PDF_PATH --out_path output.txt

# write out a json file
docker run pdftext PDF_PATH --out_path output.txt --json

```

# Benchmarks

I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.
Expand Down
2 changes: 1 addition & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,4 @@ requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.poetry.scripts]
pdftext = "extract_text:main"
pdftext = "extract_text:main"
7 changes: 7 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
joblib==1.4.0
numpy==1.26.4
pydantic==2.7.1
pydantic-settings==2.2.1
pypdfium2==4.29.0
scikit-learn==1.4.2