Skip to content

Latest commit

 

History

History
170 lines (131 loc) · 5.7 KB

Examples.md

File metadata and controls

170 lines (131 loc) · 5.7 KB

Example

DocumentLab is useful in cases when we have a document available to us only in the form of a static PDF/photo/scanned and we're only able to access the information in the file as a bitmap. This is the case with a lot of communications formats even still today since digitalization and standardization of these things only go so far when we've got countless of different systems doing the same things.

For example,

  • Invoices
  • Purchase orders
  • Annual reports
  • Sales reports
  • ... So many things. What can you think of?

The common thing across these kind of documents across a wide domain of origins is not "where things are placed on the page" or in "which order things appear" but actually contextual information. For example, receiver information on invoices across 99% of invoices can be defined within a manageable set of patterns. DocumentLab will not care if you pass in an invoice from sender A, B or C even though they may look entirely different each one of them, we only care about matching distinct patterns common across those documents.

Image

The following image is used in this example.

Script

Let's assume the data we want to extract from the image above is the following,

  • The email in the top row
  • The web address in the second row
  • The entire contents of the third column
  • All dates in the document

We can solve that by defining the following script, note that // is used for comments


// Start by finding the text "Label1" and then moving right until we capture an email
// We're expecting the pattern to include the amount in the middle but we don't care about capturing that one
Label1: 
Text(Label1) Right Amount Right [Email];

// Start by finding any text and capture the WebAddress to the right of it
// Only one row in the image will match this pattern
Label2: 
Text Right [WebAddress];

// Specify a pattern with named captures, result will be a json object 
LastColumn: 
'EmailAddress': [Email] Down 'WebAddress': [WebAddress] Down 'Number': [Number] Down 'Text': [Text];

// Capture any date in the document
Dates: 
Any [Date];

The following predicates are valid from the text analysis classifications,

  • AmountOrNumber
  • Amount
  • Date
  • Email
  • InvoiceNumber
  • Letters
  • Number
  • PageNumber
  • Percentage
  • Text
  • WebAddress
  • Definitions by custom files under Data\Context (see: Custom text type definitions)
  • Custom defined text types configured in Data\TextTypeDefinitions.json

Output

This is the output generated by DocumentLab when the image and script above is applied

{
  "Label1": "[email protected]",
  "Label2": "http://www.website.com/",
  "LastColumn": {
    "EmailAddress": "[email protected]",
    "WebAddress": "http://www.website.com/",
    "Number": "50",
    "Text": "Example"
  },
  "Dates": [
    "2018-01-01",
    "2019-12-26"
  ]
}

Using the library

Given that we're using the script and have the output structure specified above. If you want to map the result to a C# object you can define a corresponding one such as the following

public class LastColumn 
{
  public string Email { get; set; }
  public string WebAddress { get; set; }
  public int Number { get; set; }
  public string Text { get; set; }
}

public class Result 
{
  public string Label1 { get; set; }
  public string Label2 { get; set; }
  public LastColumn LastColumn { get; set; }
  public DateTime[] Dates { get; set; }
}

We can then map the output json result to a C# object with a Json converter, for instance the following using Newtonsoft,

string imagePath = "Examples\Example1.png";

string script = @"

  // Start by finding the text "Label1" and then moving right until we capture an email 
  // We're expecting the pattern to include the amount in the middle but we don't care about capturing that one
  Label1: 
  Text(Label1) Right Amount Right [Email];

  // Start by finding any text and capture the WebAddress to the right of it
  // Only one row in the image will match this pattern
  Label2: 
  Text Right [WebAddress];

  // Specify a pattern with named captures, result will be a json object  
  LastColumn: 
  'EmailAddress': [Email] Down 'WebAddress': [WebAddress] Down 'Number': [Number] Down 'Text': [Text];

  // Capture any date in the document
  Dates: 
  Any [Date];

";

// Instantiate DocumentLab 
var documentLab = new DocumentInterpreter();

// Pass in our script as a string and the document image as a bitmap, we can use System.Drawing to handle file loading
var interpretedJsonResult = documentLab.InterpretToJson(script, (Bitmap)Image.FromFile(imagePath));

/*
Write our result to console out, we'll see the following

{
  "Label1": "[email protected]",
  "Label2": "http://www.website.com/",
  "LastColumn": {
    "EmailAddress": "[email protected]",
    "WebAddress": "http://www.website.com/",
    "Number": "50",
    "Text": "Example"
  },
  "Dates": [
    "2018-01-01",
    "2019-12-26"
  ]
}
*/
Console.WriteLine(interpreterJsonResult);

// Convert the json string to defined object structure
var asObject = JsonConvert.DeserializeObject<Result>(interpretedJsonResult)

// ... We've now got an object to work with!

The base example without comments or example console out part has only three statements in it that actually perform any operation, that is, instantiation of DocumentLab, calling the InterpretToJson and the DeserializeObject method calls.