All classes are under active development and subject to non-backward compatible changes or removal in any future version. These are not subject to the Semantic Versioning model. This means that while you may use them, you may need to update your source code when upgrading to a newer version of this package.
This is a collection of sample workflows designed to showcase the usage of the Amazon Textract IDP CDK Constructs
The samples use the AWS Cloud Development Kit (AWS CDK). Also it requires Docker.
You can spin up a AWS Cloud9 instance, which has the AWS CDK and docker already set up.
After cloning the repository, install the dependencies:
pip install -r requirements.txt
Then deploy a stack, for example:
cdk deploy DemoQueries
At the moment there are 10 stacks available:
- SimpleSyncWorkflow - very easy setup, calls Textract Sync, expects single page
- SimpleAsyncWorkflow - easy, but with Async and can take multi-page
- SimpleSyncAndAsyncWorkflow - both async and sync
- PaystubAndW2Spacy - information extraction with classification using a Spacy model.
- PaystubAndW2Comprehend - information extraction with classification using a Comprehend model.
- InsuranceStack - including A2I Construct call
- AnalyzeID - only calling AnalyzeID
- AnalyzeExpense - only calling AnalyzeExpense
- DemoQueries - workflow with calling Textract + Queries for alldocs
- DocumentSplitterWorkflow - Example of splitting a multi-page document, classifying each page and extraction information depending on the document type for the page
- LendingWorkflow - Example of using the Amazon Textract Analyze Lending API to extract information from mortgage document, then generate a CSV and process pages that were marked UNCLASSIFIED by the Analzye Lending API, process them in a separate branch, extract information and generate a CSV as well
- OpenSearchWorkflow - Example of indexing a large number of files into an OpenSearch service
Deploy using
cdk deploy DocumentSplitterWorkflow
This samples includes a new component called DocumentSpliter, which takes and input document of type TIFF or PDF and outputs each individual page to an S3 location and adds the list of filenames to an array.
That array is then used in a Step Functions Map state and processed in parallel. Each iteration classifies the page and then in case of a W2 or paystub routes to an extraction process or not. At the end all the W2s and Paystubs are extracted and the map returns and array with the page numbers and their classification result.
When you look at the execution in the AWS Web Console under Step Functions and look at the execution, you may not see the correct rending in the "Graph Insepctor" while the "Execution event history" is still loading indicated by the process circle spinning next to the "Execution event history" text. Wait for it to finish.
We are planning to have a better UI experience in the future.
Deploy using
cdk deploy PaystubAndW2Spacy
This sample showcases a number of components, including classification using Comprehend and routing based on the document type, followed by configuration based on the document types.
It is called Paystub and W2, because those are the ones configured in the RouteDocType and the DemoIDP-Configurator.
At the moment it does single page, check the Document Splitter Workflow
Check the API definition for the Constructs at: https://github.com/aws-samples/amazon-textract-idp-cdk-constructs/blob/main/API.md
From top to bottom:
- DemoIDP-Decicer - Takes in the manifest, usually a link to a file on S3. Look at the bottom and the lambda_start_step_function code to see how the workflow is triggered.
- NumberOfPagesChoise - Choice state failing the flow if the number of pages > 1
- Randomize and Randomize2 - Just a function to generate a random number 0-100, so we can route between sync and async for demo purposes. (Will increase throughput as well, but that is not the main purpose, it is just to demo both async and sync in one flow)
- TextractAsync - calls Textract async through Start*. When passed features are passed, will call AnalyzeDocument, otherwise DetectText. The flow is configured to first only call with text. The process abstract calling Textract and waiting for the SNS Notification and output to OutputConfig
- TextractSync - similar to TextractAsync, but calling the Textract sync API (DetectText, AnalyzeDocument)
- TextractAsyncToJSON2 - TextractAsync will store paginated JSON files to OutputConfig. This step combines them to one JSON.
- GenerateText - Takes the Textract JSON and outputs all LINES into a text file.
- Classification - Uses the generated text file from GenerateText and sends to the Comprehend endpoint defined by the ARN.
- RouteDocType - routes based on the classification result, aka the document type. Unless it is ID, Expense or AWS_OTHER or not known, we send to further Textract processing
- DemoIDP-Configurator - based the document type, pulls configuration from DynamoDB what to call Textract with (e. g. specific queries and/or with forms and/or with tables)
- Then we repeat essentially the calls to Textract like at the top, but this time we do have the configuration set with queries, forms and/or tables
- GenerateCsvTask - output is one CSV from queries and forms information to a structure that includes also confidence scores and bounding box information
- CsvToAurora - sends the generated CSV to a Serverless Aurora RDS Cluster
The Aurora RDS Cluster runs in a private VPC. To get there, check the commented section for EC2 in the sample stack. Put in your setting for Security Groups, AMI and keypair. (We'll make it easier in the future)
Simple example of a flow only calling synchronous Textract for DetectText.
Deploy using
cdk deploy PaystubAndW2Comprehend
This sample showcases a number of components, including classification using Comprehend and routing based on the document type, followed by configuration based on the document types. It is called Paystub and W2, because those are the ones configured in the RouteDocType and the DemoIDP-Configurator.
At the moment it does single page, check the Document Splitter Workflow
Here is the flow:
Deploy using
cdk deploy SimpleAsyncWorkflow
Very basic workflow to demonstrate AsyncProcessing. This out-of-the-box will only call with DetectText, generating OCR output. When you are interested in running specific queries or features like forms or tables on a set of documents, look at DemoQueries
Deploy using
cdk deploy DemoQueries
Basic workflow to demonstrate how Sync and Async can be routed based on numberOfPages and numberOfQueries and how the workflow can be triggered with queries. Calls AnalyzeDocument with the 2 sample queries. Obviously, modify to your own needs. The location in the code where queries are configed when starting the workflow in the lambda/start_queries/app/start_execution.py when kicking off the Step Functions workflow. The GenerateCsvTask will output one CSV file to S3 with key/value, confidence scores and bounding box information based on the forms and queries output.
Deploy using
cdk deploy InsuranceStack
Simple flow including A2I
Deploy using
cdk deploy SimpleAsyncWorkflow
Simple flow calling the Textract AnalzyeID API.
Deploy using
cdk deploy AnalyzeID
Simple flow calling the Textract AnalyzeExpense API.
Deploy using
cdk deploy AnalyzeExpense
Example of using the Amazon Textract Analyze Lending API to extract information from mortgage document, then generate a CSV and process pages that were marked UNCLASSIFIED by the Analzye Lending API, process them in a separate branch, extract information and generate a CSV as well
Deploy using
cdk deploy LendingWorkflow
The workflow uses a custom classification model to identify the HOMEOWNERS_INSURANCE_APPLICATION and CONTACT_FORM. The classifier ist just trained on the sample images and for demo purposes only.
aws s3 cp s3://amazon-textract-public-content/idp-cdk-samples/lending_console_demo_with_contacts.pdf $(aws cloudformation list-exports --query 'Exports[?Name==`LendingWorkflow-DocumentUploadLocation`].Value' --output text)
then open the StepFunction flow.
aws cloudformation list-exports --query 'Exports[?Name==`LendingWorkflow-StepFunctionFlowLink`].Value' --output text
This is an example how to populate an OpenSearch service with data from documents. The index pattern includes:
- content -> the text from the page
- page -> the number of the page in the document
- uri -> the source file used for indexing
- id -> <origin_document_name><page_number> - this means a subsequent processing of the same file-name will overwrite the content
Deploy using
cdk deploy OpenSearchWorkflow
The workflow first splits the document into chunks of max 3000 pages, because that is the limit of the Textract service for asynchronous processing. Each chunk is then send to StartDocumentAnalysis extracing the OCR information from the page. The meta-data added to the context of the StepFunction workflow includes information required for creating the OpenSearch bulk import file, including ORIGIN_FILE_NAME and START_PAGE_NUMBER.
Take a look at the sample workflows. Copy one as a starting point and go from there.