A fast and efficient PDF to Markdown converter optimized for LLM (Large Language Model) processing. This tool intelligently extracts text, tables, and images from PDF files and converts them into well-structured Markdown format.
- 🚀 Fast Processing: Efficient PDF parsing and conversion
- 📊 Table Extraction: Automatically detects and converts tables to Markdown format
- 🖼️ Image Support: Extracts and saves images from PDFs (can be disabled for text-only processing)
- 📝 Smart Formatting: Intelligently identifies headings based on font sizes
- 📈 Progress Tracking: Built-in progress callback support
- 🎯 LLM Optimized: Output format optimized for LLM consumption
- 📜 Free & Open Source: MIT licensed, free to use for commercial and personal projects
See the examples/ directory for more usage examples:
financial_report_cn/: Converting financial reports with tables and imagestable_data/: Converting PDFs with complex tablescar_user_manual/: Converting car user manuals with extensive images and structured content
poetry add fastpdf4llmpip install fastpdf4llmfrom fastpdf4llm import to_markdown
# Convert PDF to Markdown
markdown_content = to_markdown("path/to/your/document.pdf")
# Save to file
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_content)The to_markdown function accepts multiple input types:
from fastpdf4llm import to_markdown
from pathlib import Path
from io import BytesIO
# String path (traditional)
markdown_content = to_markdown("path/to/document.pdf")
# Pathlib Path object
markdown_content = to_markdown(Path("path/to/document.pdf"))
# File object
with open("path/to/document.pdf", "rb") as f:
markdown_content = to_markdown(f)
# BytesIO (in-memory PDF)
with open("path/to/document.pdf", "rb") as f:
pdf_bytes = f.read()
bytes_io = BytesIO(pdf_bytes)
markdown_content = to_markdown(bytes_io)from fastpdf4llm import to_markdown
# Specify custom directory for extracted images
markdown_content = to_markdown(
"path/to/your/document.pdf",
image_dir="./images"
)from fastpdf4llm import to_markdown, ProgressInfo
def progress_callback(progress: ProgressInfo):
print(f"{progress.phase.value}: {progress.current_page}/{progress.total_pages} "
f"({progress.percentage:.1f}%) - {progress.message}")
markdown_content = to_markdown(
"path/to/your/document.pdf",
progress_callback=progress_callback
)from fastpdf4llm import to_markdown
from fastpdf4llm.models.parse_options import ParseOptions
# Customize parsing options for better text extraction
parse_options = ParseOptions(
x_tolerance_ratio=0.15, # Ratio of x_tolerance to page width (default: 0.15)
y_tolerance=3 # Control spacing between lines (default: 3)
)
markdown_content = to_markdown(
"path/to/your/document.pdf",
parse_options=parse_options
)from fastpdf4llm import to_markdown
# Disable image extraction for faster processing
# Useful when you only need text content
markdown_content = to_markdown(
"path/to/your/document.pdf",
extract_images=False
)from fastpdf4llm import to_markdown, ProgressInfo
from fastpdf4llm.models.parse_options import ParseOptions
def progress_callback(progress: ProgressInfo):
print(f"Progress: {progress.percentage:.1f}%")
parse_options = ParseOptions(x_tolerance_ratio=0.25, y_tolerance=5)
markdown_content = to_markdown(
"path/to/your/document.pdf",
image_dir="./images",
extract_images=True,
parse_options=parse_options,
progress_callback=progress_callback
)Convert a PDF file to Markdown format.
Parameters:
path_or_fp(Union[str, pathlib.Path, BufferedReader, BytesIO]): PDF file path, Path object, file handle, or BytesIO objectimage_dir(Optional[str]): Directory to save extracted images. Defaults to./tmp/images/. Only used whenextract_images=Trueextract_images(bool): Whether to extract and save images from PDF. Default:True- Set to
Falseto skip image extraction for faster processing - When
False, images are ignored and not included in the markdown output
- Set to
parse_options(Optional[ParseOptions]): Parsing options to control text extraction. Defaults toParseOptions(x_tolerance_ratio=0.15, y_tolerance=3)progress_callback(Optional[Callable[[ProgressInfo], None]]): Callback function for progress updates
Returns:
str: Markdown content of the PDF
Example:
from fastpdf4llm import to_markdown, ProgressInfo
from typing import Callable
def on_progress(progress: ProgressInfo):
print(f"Progress: {progress.percentage:.1f}%")
content = to_markdown(
"document.pdf",
image_dir="./output_images",
extract_images=True,
progress_callback=on_progress
)
# No image mode example
content_text_only = to_markdown(
"document.pdf",
extract_images=False
)Parsing options to customize PDF text extraction behavior.
Attributes:
x_tolerance_ratio(float): Ratio of x_tolerance to page width for controlling spacing tolerance between words horizontally. Default:0.15- Lower values (e.g., 0.1): More strict word separation (better for well-formatted PDFs)
- Higher values (e.g., 0.25): More lenient word grouping (better for PDFs with irregular spacing)
y_tolerance(float): Controls spacing tolerance between lines vertically. Default:3- Lower values: More strict line separation
- Higher values: More lenient line grouping
Example:
from fastpdf4llm.models.parse_options import ParseOptions
# For PDFs with tight spacing
tight_options = ParseOptions(x_tolerance_ratio=0.1, y_tolerance=1)
# For PDFs with loose spacing
loose_options = ParseOptions(x_tolerance_ratio=0.25, y_tolerance=5)
markdown_content = to_markdown("document.pdf", parse_options=tight_options)Progress information model for tracking conversion progress.
Attributes:
phase(ProcessPhase): Current processing phase (ANALYSISorCONVERSION)current_page(int): Current page being processedtotal_pages(int): Total number of pages in the PDFpercentage(float): Overall progress percentage (0-100)message(str): Status message
- Analysis Phase: Analyzes the PDF to identify font sizes and determine heading hierarchy
- Conversion Phase: Extracts text, tables, and images, converting them to Markdown format
- Smart Formatting: Automatically detects headings based on font size analysis
- Table Detection: Identifies and converts tables to Markdown table format
- Image Extraction: Extracts images and saves them to the specified directory
- Configurable Parsing: Adjustable tolerance settings for optimal text extraction from various PDF layouts
- Python >= 3.9
- pdfplumber >= 0.11.3
- loguru >= 0.7.0
- pydantic >= 2.0.0
# Clone the repository
git clone https://github.com/moria97/fastpdf4llm.git
cd fastpdf4llm
# Install dependencies
poetry install
# Install pre-commit hooks (runs automatically on git commit)
poetry run pre-commit installpoetry run pytest# Format code
poetry run ruff format .
# Lint code
poetry run ruff check .This project is inspired by the pdf2markdown4llm project by HawkClaws. We appreciate their work on PDF to Markdown conversion for LLM applications.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
https://github.com/moria97/fastpdf4llm