Skip to content

Commit

Permalink
docs(samples): add Document AI and Gemini code sample (#1072)
Browse files Browse the repository at this point in the history
# Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make
sure it goes smoothly:

- [ ] Follow the [`CONTRIBUTING`
Guide](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/CONTRIBUTING.md).
- [ ] You are listed as the author in your notebook or README file.
- [ ] Your account is listed in
[`CODEOWNERS`](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/.github/CODEOWNERS)
for the file(s).
- [ ] Make your Pull Request title in the
<https://www.conventionalcommits.org/> specification.
- [ ] Ensure the tests and linter pass (Run `nox -s format` from the
repository root to format).
- [ ] Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: Holt Skinner <[email protected]>
Co-authored-by: Ulises Jimenez <[email protected]>
Co-authored-by: Ulises Jimenez <[email protected]>
  • Loading branch information
5 people authored Nov 13, 2024
1 parent 6c9a691 commit cd8e701
Show file tree
Hide file tree
Showing 8 changed files with 608 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Document AI and Gemini: Unlocking Entity Extraction Power

This repository contains a Python script that demonstrates how to harness the combined capabilities of Google Cloud's Document AI and Gemini API to extract entities from PDF documents. Explore the strengths and nuances of each API through a side-by-side comparison of their outputs, gaining valuable insights into their performance.

Code sample leverages official documentation to send an [online processing request](https://cloud.google.com/document-ai/docs/samples/documentai-process-document), [batch processing request](https://cloud.google.com/document-ai/docs/samples/documentai-batch-process-document#documentai_batch_process_document-python) and [handle the processing response](https://cloud.google.com/document-ai/docs/handle-response).

## Why Compare Document AI and Gemini?

In the realm of extracting valuable information from documents, Document AI and Gemini emerge as two powerful tools, each with its unique strengths and specialized capabilities.

### Document AI

Document AI is a specialized at extracting data from a wide variety of documents, including invoices, receipts, contracts, and more. It leverages advanced optical character recognition (OCR) and machine learning models to accurately identify and classify key information, transforming unstructured text into structured data.

#### Document AI Use Cases

- **Automating Invoice Processing**: Document AI can efficiently extract invoice details like vendor information, invoice number, line items, and total amounts, streamlining processes and reducing manual errors.
- **Streamlining Contract Analysis**: By identifying key clauses, dates, and parties involved in contracts, Document AI enables faster and more accurate legal document review.
- **Digitizing Medical Records**: Document AI can extract patient information, diagnoses, medications, and other crucial details from medical forms and records, facilitating efficient data management and analysis in the healthcare sector.

### Gemini

Gemini is a family of large language models known for its capabilities in natural language understanding and generation. It excels at comprehending the nuances of human language, enabling it to perform tasks such as summarization, translation, question answering, and even creative writing.

#### Gemini Use Cases

- **Creating Chatbots and Virtual Assistants**: Gemini's natural language processing prowess empowers it to engage in dynamic conversations, providing personalized customer support or acting as an intelligent assistant across various applications.
- **Generating Content**: Whether it's drafting emails, writing articles, or composing creative stories, Gemini can generate high-quality text based on given prompts or contexts.
- **Summarizing Complex Documents**: Gemini can distill lengthy documents into concise summaries, saving time and facilitating efficient information consumption.

#### Comparing Document AI and Gemini: Synergy and Specialization

While both tools offer entity extraction capabilities, their focus and strengths differ significantly.

- **Document AI**: Ideal for extracting data from specific document types with predefined schemas.
- **Gemini**: Best suited for extracting entities from unstructured text and understanding the nuances of natural language.

By comparing their results in entity extraction tasks, you gain insights into their unique approaches and potential trade-offs, enabling you to choose the right tool for specific needs.
Moreover, combining the power of Document AI and Gemini can lead to innovative solutions. For instance, you can use Document AI to extract structured information from documents and then leverage Gemini to generate natural language summaries or insights based on the extracted data.

## What You'll Gain

- Hands-on Experience: The included Python script provides a practical example of integrating Document AI and Gemini, allowing you to experiment with both APIs and explore their capabilities firsthand.
- Deeper Understanding: The comparison of API results will shed light on the strengths and potential trade-offs of each tool, helping you make informed decisions for future projects.
- Customization: The code serves as a foundation that you can adapt and extend for your own entity extraction needs, tailoring it to specific document types and entity types.

## Key Features

- Document AI Integration: The script utilizes Document AI's robust entity extraction capabilities, showcasing its ability to identify and classify key information within PDF documents.
- Gemini API Interaction: The script leverages Gemini's natural language understanding to extract entities based on a carefully crafted prompt, demonstrating its versatility.
- Side-by-Side Comparison: The output highlights the strengths and potential differences in entity extraction results from both APIs, providing valuable insights into their performance.

## Setup

1. **Install Dependencies**

```bash
pip install --upgrade google-cloud-aiplatform
pip install -q -U google-generativeai
pip install -r requirements.txt

```

2. **Assumption**

This repository assumes, that this [codelab](https://www.cloudskillsboost.google/focuses/67855?parent=catalog) has been completed, that a dataset with the test documents is available and there exists a Document AI extractor.

Once it has been completed, additionally create two buckets, for batch processing, namely, temp and output.

```bash
TEMP_BUCKET_URI = f"gs://documentai-temp-{PROJECT_ID}-unique"
gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}
OUT_BUCKET_URI = f"gs://documentai-temp-{TEMP_BUCKET_URI}-unique"
gsutil mb -l {LOCATION} -p {PROJECT_ID} {OUT_BUCKET_URI}
```

## Diagram

![alt text](diagram.png)

## Code Overview

- `test_doc_ai.py`: This script orchestrates the entire process:

- It first uses the Document AI API to process the PDF and extract entities.
- Then, it utilizes the Gemini API with a tailored prompt to extract entities from the same PDF.
- Finally, it compares the results from both APIs and prints a summary.

- `extractor.py`: Contains classes for interacting with the Document AI API for both online and batch processing.

- `entity_processor.py`: Defines classes for extracting entities from the Document AI output and the Gemini API response.

- `prompts_module.py`: Provides functions to generate prompts for entity extraction and comparison tasks for the Gemini API.

- `temp_file_uploader.py`: Handles uploading files to a temporary Google Cloud Storage location for processing.

## Notes

- Ensure that your Google Cloud project has the necessary APIs enabled (Document AI, Vertex AI, etc.).
- The script is configured to process a single PDF file. You can modify it to process multiple files or handle different input sources.
- The accuracy and performance of entity extraction may vary depending on the document complexity and the chosen API parameters.
- This script is intended for demonstration purposes. You can adapt and extend it to suit your specific use case and integrate it into your applications.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import json
import mimetypes
from typing import Dict
from abc import ABC, abstractmethod
from google.cloud import documentai
from vertexai.generative_models import GenerativeModel, Part, GenerationConfig


class EntityExtractor(ABC):
"""Abstract Base Class for entity extraction."""

@abstractmethod
def extract_entities(self) -> Dict:
"""Abstract method to extract entities."""


class DocumentAIEntityExtractor(EntityExtractor):
"""Class for Document AI entity extraction"""

def __init__(self, document: documentai.Document) -> None:
self.document = document

def extract_entities(self) -> Dict:
entities = {}
for entity in self.document.entities:
entities[entity.type_] = entity.mention_text
return entities


class ModelBasedEntityExtractor(EntityExtractor):
"""Class for Gemini entity extraction"""

def __init__(self, model_version: str, prompt: str, file_path: str) -> None:
self.config = GenerationConfig(
temperature=0.0,
top_p=0.8,
top_k=32,
candidate_count=1,
max_output_tokens=2048,
response_mime_type="application/json",
)
self.model = GenerativeModel(model_version, generation_config=self.config)
self.prompt = prompt
mime_type = mimetypes.guess_type(file_path)[0]
if (mime_type is None) or (mime_type != "application/pdf"):
raise ValueError("Only PDF files are supported, aborting")
self.file_path = file_path

def extract_entities(self) -> Dict:
pdf_file = Part.from_uri(self.file_path, mime_type="application/pdf")
contents = [pdf_file, self.prompt]
response = self.model.generate_content(contents)

cleaned_string = response.text.replace("```json\n", "").replace("\n```", "")
entities = json.loads(cleaned_string)
return entities
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
import re
from typing import Any, Optional

from google.api_core.client_options import ClientOptions
from google.api_core.exceptions import InternalServerError, RetryError
from google.cloud import documentai, storage
from temp_file_uploader import TempFileUploader


class DocumentExtractor:
"""Abstract base class for document extraction."""

def __init__(
self,
project_id: str,
location: str,
processor_id: str,
processor_version_id: Optional[str] = None,
):
self.project_id = project_id
self.location = location
self.processor_id = processor_id
self.processor_version_id = processor_version_id
self.client = documentai.DocumentProcessorServiceClient(
client_options=ClientOptions(
api_endpoint=f"{location}-documentai.googleapis.com"
)
)
self.processor_name = self._get_proccessor_name()

def _get_proccessor_name(self) -> Any:
if self.processor_version_id:
return self.client.processor_version_path(
self.project_id,
self.location,
self.processor_id,
self.processor_version_id,
)
return self.client.processor_path(
self.project_id, self.location, self.processor_id
)

def process_document(self, file_path: str, mime_type: str) -> documentai.Document:
"""abstract function for document processing"""
raise NotImplementedError


class OnlineDocumentExtractor(DocumentExtractor):
"""
Processes documents using the online Document AI API.
"""

def process_document(
self, file_path: str, mime_type: str = "application/pdf"
) -> documentai.Document:
with open(file_path, "rb") as image:
image_content = image.read()

request = documentai.ProcessRequest(
name=self.processor_name,
raw_document=documentai.RawDocument(
content=image_content, mime_type=mime_type
),
)

result = self.client.process_document(request=request)
return result.document


class BatchDocumentExtractor(DocumentExtractor):
"""
Processes documents using the batch Document AI API.
"""

# pylint: disable=too-many-arguments
def __init__(
self,
project_id: str,
location: str,
processor_id: str,
gcs_output_uri: str,
gcs_temp_uri: str,
processor_version_id: str,
timeout: int = 400,
):
super().__init__(project_id, location, processor_id, processor_version_id)
self.gcs_output_uri = gcs_output_uri
self.timeout = timeout
self.storage_client = storage.Client()
self.temp_file_uploader = TempFileUploader(gcs_temp_uri)

def process_document(self, file_path: str, mime_type: str) -> documentai.Document:
gcs_input_uri = self.temp_file_uploader.upload_file(file_path)
document = self._process_document_batch(gcs_input_uri, mime_type)
self.temp_file_uploader.delete_file()
return document

# pylint: disable=too-many-locals
def _process_document_batch(
self, gcs_input_uri: str, mime_type: str
) -> documentai.Document:
gcs_document = documentai.GcsDocument(
gcs_uri=gcs_input_uri, mime_type=mime_type
)
gcs_documents = documentai.GcsDocuments(documents=[gcs_document])
input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)

gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
gcs_uri=self.gcs_output_uri
)
output_config = documentai.DocumentOutputConfig(
gcs_output_config=gcs_output_config
)

request = documentai.BatchProcessRequest(
name=self.processor_name,
input_documents=input_config,
document_output_config=output_config,
)

operation = self.client.batch_process_documents(request)
try:
print(f"Waiting for operation ({operation.operation.name}) to complete...")
operation.result(timeout=self.timeout)
except (RetryError, InternalServerError) as e:
print(e.message)

metadata = documentai.BatchProcessMetadata(operation.metadata)
if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
raise ValueError(f"Batch Process Failed: {metadata.state_message}")

# Retrieve the processed document from GCS
for process in list(metadata.individual_process_statuses):
matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
if not matches:
print(
"Could not parse output GCS destination:",
process.output_gcs_destination,
)
continue

output_bucket, output_prefix = matches.groups()
output_blobs = self.storage_client.list_blobs(
output_bucket, prefix=output_prefix
)
for blob in output_blobs:
if blob.content_type == "application/json":
print(f"Fetching {blob.name}")
return documentai.Document.from_json(
blob.download_as_bytes(), ignore_unknown_fields=True
)

raise FileNotFoundError("Processed document not found in GCS.")
Loading

0 comments on commit cd8e701

Please sign in to comment.