pdfdeal

Better RAG Effect!

🗺️ ENGLISH | 简体中文

Handle PDF more easily and simply, utilizing Doc2X's powerful document conversion capabilities for retained format file conversion/RAG enhancement.

Introduction

Doc2X Support

Doc2X is a new universal document OCR tool that can convert images or PDF files into Markdown/LaTeX text with formulas and text formatting. It performs better than similar tools in most scenarios. pdfdeal provides abstract packaged classes to use Doc2X for requests.

Processing PDFs

Use various OCR or PDF recognition tools to identify images and add them to the original text. You can set the output format to use PDF, which will ensure that the recognized text retains the same page numbers as the original in the new PDF. It also offers various practical file processing tools.

After conversion and pre-processing of PDF using Doc2X, you can achieve better recognition rates when used with knowledge base applications such as graphrag, Dify, and FastGPT.

Cases

graphrag

See how to use it with graphrag, its not supported to recognize pdf, but you can use the CLI tool doc2x to convert it to a txt document for use.

Fastgpt/Dify or other RAG system

Or for knowledge base applications, you can use pdfdeal's built-in variety of enhancements to documents, such as uploading images to remote storage services, adding breaks by paragraph, etc. See Integration with RAG applications.

RAG system plug-in integration

You can find Doc2X plugin in FastGPT 4.8.9 and later which supports PDF/image conversion.

Documentation

For details, please refer to the documentation

Or check out the documentation repository pdfdeal-docs.

Quick Start

For details, please refer to the documentation

Installation

Install from PyPI:

pip install --upgrade pdfdeal

Using Doc2X as PDF deal tool

from pdfdeal import Doc2X
from pdfdeal import get_files

client = Doc2X()
file_list, rename = get_files(path="tests/pdf", mode="pdf", out="pdf")
success, failed, flag = client.pdfdeal(
    pdf_file=file_list,
    output_path="./Output/test/multiple/pdfdeal",
    output_names=rename,
)
print(success)
print(failed)
print(flag)

Using pytesseract as an OCR engine

When using "pytesseract", make sure that tesseract is installed first:

pip install 'pdfdeal[pytesseract]'

from pdfdeal import deal_pdf, get_files

files, rename = get_files("tests/pdf", "pdf", "md")
output_path, failed, flag = deal_pdf(
    pdf_file=files,
    output_format="md",
    ocr="pytesseract",
    language=["eng"],
    output_path="Output",
    output_names=rename,
)
for f in output_path:
    print(f"Save processed file to {f}")

See the online documentation for details.

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
.github/workflows		.github/workflows
docs		docs
src/pdfdeal		src/pdfdeal
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfdeal

Introduction

Doc2X Support

Processing PDFs

Cases

graphrag

Fastgpt/Dify or other RAG system

RAG system plug-in integration

Documentation

Quick Start

Installation

Using Doc2X as PDF deal tool

Using pytesseract as an OCR engine

About

Releases 20

Packages

Languages

License

Menghuan1918/pdfdeal

Folders and files

Latest commit

History

Repository files navigation

pdfdeal

Introduction

Doc2X Support

Processing PDFs

Cases

graphrag

Fastgpt/Dify or other RAG system

RAG system plug-in integration

Documentation

Quick Start

Installation

Using Doc2X as PDF deal tool

Using pytesseract as an OCR engine

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 20

Packages 0

Languages

Packages