docs(samples): add Document AI and Gemini code sample (#1072)

# Description Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Follow the [`CONTRIBUTING` Guide](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/CONTRIBUTING.md). - [ ] You are listed as the author in your notebook or README file. - [ ] Your account is listed in [`CODEOWNERS`](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/.github/CODEOWNERS) for the file(s). - [ ] Make your Pull Request title in the <https://www.conventionalcommits.org/> specification. - [ ] Ensure the tests and linter pass (Run `nox -s format` from the repository root to format). - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> 🦕 --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Holt Skinner <[email protected]> Co-authored-by: Ulises Jimenez <[email protected]> Co-authored-by: Ulises Jimenez <[email protected]>
GoogleCloudPlatform · Nov 13, 2024 · cd8e701 · cd8e701
1 parent 6c9a691
commit cd8e701
Show file tree

Hide file tree

Showing 8 changed files with 608 additions and 0 deletions.
diff --git a/...ses/applying-llms-to-data/gemini-and-documentai-for-entity-extraction/README.md b/...ses/applying-llms-to-data/gemini-and-documentai-for-entity-extraction/README.md
@@ -0,0 +1,105 @@
+# Document AI and Gemini: Unlocking Entity Extraction Power
+
+This repository contains a Python script that demonstrates how to harness the combined capabilities of Google Cloud's Document AI and Gemini API to extract entities from PDF documents. Explore the strengths and nuances of each API through a side-by-side comparison of their outputs, gaining valuable insights into their performance.
+
+Code sample leverages official documentation to send an [online processing request](https://cloud.google.com/document-ai/docs/samples/documentai-process-document), [batch processing request](https://cloud.google.com/document-ai/docs/samples/documentai-batch-process-document#documentai_batch_process_document-python) and [handle the processing response](https://cloud.google.com/document-ai/docs/handle-response).
+
+## Why Compare Document AI and Gemini?
+
+In the realm of extracting valuable information from documents, Document AI and Gemini emerge as two powerful tools, each with its unique strengths and specialized capabilities.
+
+### Document AI
+
+Document AI is a specialized at extracting data from a wide variety of documents, including invoices, receipts, contracts, and more. It leverages advanced optical character recognition (OCR) and machine learning models to accurately identify and classify key information, transforming unstructured text into structured data.
+
+#### Document AI Use Cases
+
+- **Automating Invoice Processing**: Document AI can efficiently extract invoice details like vendor information, invoice number, line items, and total amounts, streamlining processes and reducing manual errors.
+- **Streamlining Contract Analysis**: By identifying key clauses, dates, and parties involved in contracts, Document AI enables faster and more accurate legal document review.
+- **Digitizing Medical Records**: Document AI can extract patient information, diagnoses, medications, and other crucial details from medical forms and records, facilitating efficient data management and analysis in the healthcare sector.
+
+### Gemini
+
+Gemini is a family of large language models known for its capabilities in natural language understanding and generation. It excels at comprehending the nuances of human language, enabling it to perform tasks such as summarization, translation, question answering, and even creative writing.
+
+#### Gemini Use Cases
+
+- **Creating Chatbots and Virtual Assistants**: Gemini's natural language processing prowess empowers it to engage in dynamic conversations, providing personalized customer support or acting as an intelligent assistant across various applications.
+- **Generating Content**: Whether it's drafting emails, writing articles, or composing creative stories, Gemini can generate high-quality text based on given prompts or contexts.
+- **Summarizing Complex Documents**: Gemini can distill lengthy documents into concise summaries, saving time and facilitating efficient information consumption.
+
+#### Comparing Document AI and Gemini: Synergy and Specialization
+
+While both tools offer entity extraction capabilities, their focus and strengths differ significantly.
+
+- **Document AI**: Ideal for extracting data from specific document types with predefined schemas.
+- **Gemini**: Best suited for extracting entities from unstructured text and understanding the nuances of natural language.
+
+By comparing their results in entity extraction tasks, you gain insights into their unique approaches and potential trade-offs, enabling you to choose the right tool for specific needs.
+Moreover, combining the power of Document AI and Gemini can lead to innovative solutions. For instance, you can use Document AI to extract structured information from documents and then leverage Gemini to generate natural language summaries or insights based on the extracted data.
+
+## What You'll Gain
+
+- Hands-on Experience: The included Python script provides a practical example of integrating Document AI and Gemini, allowing you to experiment with both APIs and explore their capabilities firsthand.
+- Deeper Understanding: The comparison of API results will shed light on the strengths and potential trade-offs of each tool, helping you make informed decisions for future projects.
+- Customization: The code serves as a foundation that you can adapt and extend for your own entity extraction needs, tailoring it to specific document types and entity types.
+
+## Key Features
+
+- Document AI Integration: The script utilizes Document AI's robust entity extraction capabilities, showcasing its ability to identify and classify key information within PDF documents.
+- Gemini API Interaction: The script leverages Gemini's natural language understanding to extract entities based on a carefully crafted prompt, demonstrating its versatility.
+- Side-by-Side Comparison: The output highlights the strengths and potential differences in entity extraction results from both APIs, providing valuable insights into their performance.
+
+## Setup
+
+1. **Install Dependencies**
+
+   ```bash
+   pip install --upgrade google-cloud-aiplatform
+   pip install -q -U google-generativeai
+   pip install -r requirements.txt
+
+   ```
+
+2. **Assumption**
+
+This repository assumes, that this [codelab](https://www.cloudskillsboost.google/focuses/67855?parent=catalog) has been completed, that a dataset with the test documents is available and there exists a Document AI extractor.
+
+Once it has been completed, additionally create two buckets, for batch processing, namely, temp and output.
+
+```bash
+TEMP_BUCKET_URI = f"gs://documentai-temp-{PROJECT_ID}-unique"
+
+gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}
+
+OUT_BUCKET_URI = f"gs://documentai-temp-{TEMP_BUCKET_URI}-unique"
+
+gsutil mb -l {LOCATION} -p {PROJECT_ID} {OUT_BUCKET_URI}
+```
+
+## Diagram
+
+![alt text](diagram.png)
+
+## Code Overview
+
+- `test_doc_ai.py`: This script orchestrates the entire process:
+
+  - It first uses the Document AI API to process the PDF and extract entities.
+  - Then, it utilizes the Gemini API with a tailored prompt to extract entities from the same PDF.
+  - Finally, it compares the results from both APIs and prints a summary.
+
+- `extractor.py`: Contains classes for interacting with the Document AI API for both online and batch processing.
+
+- `entity_processor.py`: Defines classes for extracting entities from the Document AI output and the Gemini API response.
+
+- `prompts_module.py`: Provides functions to generate prompts for entity extraction and comparison tasks for the Gemini API.
+
+- `temp_file_uploader.py`: Handles uploading files to a temporary Google Cloud Storage location for processing.
+
+## Notes
+
+- Ensure that your Google Cloud project has the necessary APIs enabled (Document AI, Vertex AI, etc.).
+- The script is configured to process a single PDF file. You can modify it to process multiple files or handle different input sources.
+- The accuracy and performance of entity extraction may vary depending on the document complexity and the chosen API parameters.
+- This script is intended for demonstration purposes. You can adapt and extend it to suit your specific use case and integrate it into your applications.
diff --git a/...s/applying-llms-to-data/gemini-and-documentai-for-entity-extraction/diagram.png b/...s/applying-llms-to-data/gemini-and-documentai-for-entity-extraction/diagram.png
diff --git a/...ses/applying-llms-to-data/gemini-and-documentai-for-entity-extraction/entity_processor.py b/...ses/applying-llms-to-data/gemini-and-documentai-for-entity-extraction/entity_processor.py
@@ -0,0 +1,56 @@
+import json
+import mimetypes
+from typing import Dict
+from abc import ABC, abstractmethod
+from google.cloud import documentai
+from vertexai.generative_models import GenerativeModel, Part, GenerationConfig
+
+
+class EntityExtractor(ABC):
+    """Abstract Base Class for entity extraction."""
+
+    @abstractmethod
+    def extract_entities(self) -> Dict:
+        """Abstract method to extract entities."""
+
+
+class DocumentAIEntityExtractor(EntityExtractor):
+    """Class for Document AI entity extraction"""
+
+    def __init__(self, document: documentai.Document) -> None:
+        self.document = document
+
+    def extract_entities(self) -> Dict:
+        entities = {}
+        for entity in self.document.entities:
+            entities[entity.type_] = entity.mention_text
+        return entities
+
+
+class ModelBasedEntityExtractor(EntityExtractor):
+    """Class for Gemini entity extraction"""
+
+    def __init__(self, model_version: str, prompt: str, file_path: str) -> None:
+        self.config = GenerationConfig(
+            temperature=0.0,
+            top_p=0.8,
+            top_k=32,
+            candidate_count=1,
+            max_output_tokens=2048,
+            response_mime_type="application/json",
+        )
+        self.model = GenerativeModel(model_version, generation_config=self.config)
+        self.prompt = prompt
+        mime_type = mimetypes.guess_type(file_path)[0]
+        if (mime_type is None) or (mime_type != "application/pdf"):
+            raise ValueError("Only PDF files are supported, aborting")
+        self.file_path = file_path
+
+    def extract_entities(self) -> Dict:
+        pdf_file = Part.from_uri(self.file_path, mime_type="application/pdf")
+        contents = [pdf_file, self.prompt]
+        response = self.model.generate_content(contents)
+
+        cleaned_string = response.text.replace("```json\n", "").replace("\n```", "")
+        entities = json.loads(cleaned_string)
+        return entities
diff --git a/.../use-cases/applying-llms-to-data/gemini-and-documentai-for-entity-extraction/extractor.py b/.../use-cases/applying-llms-to-data/gemini-and-documentai-for-entity-extraction/extractor.py
@@ -0,0 +1,153 @@
+import re
+from typing import Any, Optional
+
+from google.api_core.client_options import ClientOptions
+from google.api_core.exceptions import InternalServerError, RetryError
+from google.cloud import documentai, storage
+from temp_file_uploader import TempFileUploader
+
+
+class DocumentExtractor:
+    """Abstract base class for document extraction."""
+
+    def __init__(
+        self,
+        project_id: str,
+        location: str,
+        processor_id: str,
+        processor_version_id: Optional[str] = None,
+    ):
+        self.project_id = project_id
+        self.location = location
+        self.processor_id = processor_id
+        self.processor_version_id = processor_version_id
+        self.client = documentai.DocumentProcessorServiceClient(
+            client_options=ClientOptions(
+                api_endpoint=f"{location}-documentai.googleapis.com"
+            )
+        )
+        self.processor_name = self._get_proccessor_name()
+
+    def _get_proccessor_name(self) -> Any:
+        if self.processor_version_id:
+            return self.client.processor_version_path(
+                self.project_id,
+                self.location,
+                self.processor_id,
+                self.processor_version_id,
+            )
+        return self.client.processor_path(
+            self.project_id, self.location, self.processor_id
+        )
+
+    def process_document(self, file_path: str, mime_type: str) -> documentai.Document:
+        """abstract function for document processing"""
+        raise NotImplementedError
+
+
+class OnlineDocumentExtractor(DocumentExtractor):
+    """
+    Processes documents using the online Document AI API.
+    """
+
+    def process_document(
+        self, file_path: str, mime_type: str = "application/pdf"
+    ) -> documentai.Document:
+        with open(file_path, "rb") as image:
+            image_content = image.read()
+
+        request = documentai.ProcessRequest(
+            name=self.processor_name,
+            raw_document=documentai.RawDocument(
+                content=image_content, mime_type=mime_type
+            ),
+        )
+
+        result = self.client.process_document(request=request)
+        return result.document
+
+
+class BatchDocumentExtractor(DocumentExtractor):
+    """
+    Processes documents using the batch Document AI API.
+    """
+
+    # pylint: disable=too-many-arguments
+    def __init__(
+        self,
+        project_id: str,
+        location: str,
+        processor_id: str,
+        gcs_output_uri: str,
+        gcs_temp_uri: str,
+        processor_version_id: str,
+        timeout: int = 400,
+    ):
+        super().__init__(project_id, location, processor_id, processor_version_id)
+        self.gcs_output_uri = gcs_output_uri
+        self.timeout = timeout
+        self.storage_client = storage.Client()
+        self.temp_file_uploader = TempFileUploader(gcs_temp_uri)
+
+    def process_document(self, file_path: str, mime_type: str) -> documentai.Document:
+        gcs_input_uri = self.temp_file_uploader.upload_file(file_path)
+        document = self._process_document_batch(gcs_input_uri, mime_type)
+        self.temp_file_uploader.delete_file()
+        return document
+
+    # pylint: disable=too-many-locals
+    def _process_document_batch(
+        self, gcs_input_uri: str, mime_type: str
+    ) -> documentai.Document:
+        gcs_document = documentai.GcsDocument(
+            gcs_uri=gcs_input_uri, mime_type=mime_type
+        )
+        gcs_documents = documentai.GcsDocuments(documents=[gcs_document])
+        input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)
+
+        gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
+            gcs_uri=self.gcs_output_uri
+        )
+        output_config = documentai.DocumentOutputConfig(
+            gcs_output_config=gcs_output_config
+        )
+
+        request = documentai.BatchProcessRequest(
+            name=self.processor_name,
+            input_documents=input_config,
+            document_output_config=output_config,
+        )
+
+        operation = self.client.batch_process_documents(request)
+        try:
+            print(f"Waiting for operation ({operation.operation.name}) to complete...")
+            operation.result(timeout=self.timeout)
+        except (RetryError, InternalServerError) as e:
+            print(e.message)
+
+        metadata = documentai.BatchProcessMetadata(operation.metadata)
+        if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
+            raise ValueError(f"Batch Process Failed: {metadata.state_message}")
+
+        # Retrieve the processed document from GCS
+        for process in list(metadata.individual_process_statuses):
+            matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
+            if not matches:
+                print(
+                    "Could not parse output GCS destination:",
+                    process.output_gcs_destination,
+                )
+                continue
+
+            output_bucket, output_prefix = matches.groups()
+            output_blobs = self.storage_client.list_blobs(
+                output_bucket, prefix=output_prefix
+            )
+            for blob in output_blobs:
+                if blob.content_type == "application/json":
+                    print(f"Fetching {blob.name}")
+                    return documentai.Document.from_json(
+                        blob.download_as_bytes(), ignore_unknown_fields=True
+                    )
+
+        raise FileNotFoundError("Processed document not found in GCS.")