diff --git a/README.md b/README.md index 765509e..98c7d58 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,8 @@
_Extractous offers a fast and efficient solution for extracting content and metadata from various documents types such as PDF, Word, HTML, and [many other formats](#supported-file-formats). -Our goal is to deliver an efficient comprehensive solution with bindings for many programming languages._ +Our goal is to deliver a fast and efficient comprehensive solution in Rust with bindings for many programming +languages._
@@ -30,32 +31,105 @@ Our goal is to deliver an efficient comprehensive solution with bindings for man For complete benchmarking details please consult our [benchmarking repository](https://github.com/yobix-ai/extractous-benchmarks) ![unstructured_vs_extractous](https://github.com/yobix-ai/extractous-benchmarks/raw/main/docs/extractous_vs_unstructured.gif) -* demo running at 5x recoding speed +* demo running at 5x recoding speed ## Why Extractous? -Extractous was mainly inspired by the [Unstructured Python library](https://github.com/Unstructured-IO/unstructured). -While Unstructured offers a good solution for parsing unstructured content, we see 2 main issues with it: +**Extractous** was born out of frustration with requiring yet another service to handle content extraction out of +unstructured data. Do we really need to call external APIs or run special servers just for content extraction? Can't +we perform the extraction locally and efficiently? -* Performance: data processing is mainly a cpu-bound problem and Python is not the best choice for such tasks +While researching this space, **unstructured-io** offers a good solution for parsing unstructured content, and can be +performed in-process. However, it's performance is very poor and has many limitations: +* **unstructured-io** wraps around so many heavy Python libraries making it both slow and memory hungry [See benchmarks foo more details](https://github.com/yobix-ai/extractous-benchmarks). +* Data processing is mainly a cpu-bound problem and Python is not the best choice for such tasks because of its Global Interpreter Lock (GIL) which makes it hard to utilize multiple cores. -* [Unstructured](https://github.com/Unstructured-IO/unstructured) is becoming more of an LLM framework rather than - just text and metadata parsing library. +* **unstructured-io** is becoming increasingly complex as it focuses on becoming more of a framework rather than + just a text and metadata extraction library. + +In contrast, **Extractous** is built in Rust, a language renowned for its memory safety and high performance. By +leveraging Rust's multithreading capabilities and zero-cost abstractions, Extractous achieves significantly faster +processing speeds. **Extractous** maintains a dedicated focus on text and metadata extraction, ensuring optimized +performance and reliability in its core functionality. + +## 🌳 Key Features +* Fast and efficient unstructured data extraction. +* Clear and simple API for extracting text and metadata content. +* Autodetect document type and extracts content accordingly. +* Supports [many file formats](#supported-file-formats). +* Extracts text from images and scanned documents with OCR through [tesseract-ocr](https://github.com/tesseract-ocr/tesseract). +* Leverages Rust performance and memory safety and provides bindings for [Python](https://pypi.org/project/extractous/) + and Javascript/Typescript(coming soon) +* Comprehensive documentation and examples to help you get started quickly. +* Free for Commercial Use: Apache 2.0 License. -Extractous will focus only on the text and metadata extraction part. The core is written in Rust, leveraging its -memory safety, multithreading and zero cost abstractions. Extractous will provide bindings for many programming -languages. +## 🚀 Quickstart +Extractous provides a simple and easy-to-use API for extracting content from various file formats. Below are examples: -## Features +### Python +* Extract a file content to a string: +```python +from extractous import Extractor -* Clear simple API for extracting text and metadata content. -* Support for [many file formats](#supported-file-formats). -* Strives to be efficient and fast. -* Comprehensive documentation and examples to help you get started quickly. +# Create a new extractor +extractor = Extractor() +extractor.set_extract_string_max_length(1000) + +# Extract text from a file +result = extractor.extract_file_to_string("README.md") +print(result) +``` + +### Rust +* Extract a file content to a string: +```rust +use extractous::Extractor; +use extractous::PdfParserConfig; + +// Create a new extractor. Note it uses a consuming builder pattern +let mut extractor = Extractor::new().set_extract_string_max_length(1000); + +// Extract text from a file +let text = extractor.extract_file_to_string("README.md").unwrap(); +println!("{}", text); +``` + +## 🔥 Performance +* **Extractous** is built in fast, don't take our word for it, you can run the [benchmarks](https://github.com/yobix-ai/extractous-benchmarks) yourself. For example extracting content out of sec10 filings + pdf forms, **Extractous** is 22x faster than **unstructured-io**: + +![extractous_speedup_relative_to_unstructured](https://github.com/yobix-ai/extractous-benchmarks/raw/main/docs/extractous_speedup_relative_to_unstructured.png) + +* Not just speed it is also memory efficient, **Extractous** allocates 12x less memory than **unstructured-io**: + +![extractous_memory_efficiency_relative_to_unstructured](https://github.com/yobix-ai/extractous-benchmarks/raw/main/docs/extractous_memory_efficiency_relative_to_unstructured.png) + + + +## 📄 Supported file formats + +| **Category** | **Supported Formats** | **Notes** | +|---------------------|---------------------------------------------------------|------------------------------------------------| +| **Microsoft Office**| DOC, DOCX, PPT, PPTX, XLS, XLSX, RTF | Includes legacy and modern Office file formats | +| **OpenOffice** | ODT, ODS, ODP | OpenDocument formats | +| **PDF** | PDF | Can extracts embedded content and supports OCR | +| **Spreadsheets** | CSV, TSV | Plain text spreadsheet formats | +| **Web Documents** | HTML, XML | Parses and extracts content from web documents | +| **E-Books** | EPUB | EPUB format for electronic books | +| **Text Files** | TXT, Markdown | Plain text formats | +| **Images** | PNG, JPEG, TIFF, BMP, GIF, ICO, PSD, SVG | Extracts embedded text with OCR | +| **E-Mail** | EML, MSG, MBOX, PST | Extracts content, headers, and attachments | + +[//]: # (| **Archives** | ZIP, TAR, GZIP, RAR, 7Z | Extracts content from compressed archives |) +[//]: # (| **Audio** | MP3, WAV, OGG, FLAC, AU, MIDI, AIFF, APE | Extracts metadata such as ID3 tags |) +[//]: # (| **Video** | MP4, AVI, MOV, WMV, FLV, MKV, WebM | Extracts metadata and basic information |) +[//]: # (| **CAD Files** | DXF, DWG | Supports CAD formats for engineering drawings |) +[//]: # (| **Other** | ICS (Calendar), VCF (vCard) | Supports calendar and contact file formats |) +[//]: # (| **Geospatial** | KML, KMZ, GeoJSON | Extracts geospatial data and metadata |) +[//]: # (| **Font Files** | TTF, OTF | Extracts metadata from font files |) -## Supported file formats +## 🤝 Contributing +Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or new features to propose. -| File Format | Rust Core | Python Binding | -|-------------|-----------|----------------| -| pdf | ✅ | ✅ | -| csv | ✅ | ✅ | \ No newline at end of file +## 🕮 License +This project is licensed under the Apache License 2.0. See the LICENSE file for details. \ No newline at end of file