Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance update: match keywords on extracted_str #470

Merged
merged 1 commit into from
Mar 18, 2023

Conversation

bosd
Copy link
Collaborator

@bosd bosd commented Feb 12, 2023

Before this PR for each individual template an optimized string was generated.
This impacts the performance negatively, specifically if one has a lot of templates.

This PR greatly increases performance as an optimized_str is only generated on the matched template instead of on all templates.

On my local system, I realized a 2x performance increase.

⚠️ Warning: Every performance increase comes at a cost. It might break some templates.

Copy link
Collaborator

@rmilecki rmilecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's acceptable. If some invoices require more complex matching, one can always use a regex in the keywords (just like in com.flipkart.WSRetail.yml as modified in this pull request).

@m3nu: do you have an opinion on this?

@bosd bosd requested a review from m3nu February 20, 2023 08:16
@rmilecki
Copy link
Collaborator

Ping @m3nu

@m3nu
Copy link
Collaborator

m3nu commented Mar 11, 2023 via email

@bosd
Copy link
Collaborator Author

bosd commented Mar 16, 2023

Maybe add a graph of the steps at some point to make it quicker to understand and start?

I am thinking of something like this:

flowchart LR
    InvoiceFile[fa:fa-file-invoice Invoicefile\n\npdf\nimage\ntext] --> Input-module(Input Module\n\npdftotext\ntext\npdfminer\npdfplumber\ntesseract\ngvision)
    Input-module --> |Extracted Text| C{keyword\nmatching}
    Invoice-Templates[fa:fa-file-lines Invoice Templates] --> C{keyword\nmatching}
    C --> |Extracted Text + fa:fa-file-circle-check Template| E(Template Processing\n apply options from template\nremove accents, replaces etc...)
    E --> |Optimized String|Plugins&Parsers(Call plugins + parsers)
    subgraph Plugins&Parsers
      direction BT
        tables[fa:fa-table tables] ~~~ lines[fa:fa-grip-lines lines]
        lines ~~~ regex[fa:fa-code regex]
        regex ~~~ static[fa:fa-check static]

    end
    Plugins&Parsers --> |output| result[result\nfa:fa-file-csv,\njson,\nXML]

 click Invoice-Templates https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md
 click result https://github.com/invoice-x/invoice2data#usage
 click Input-module https://github.com/invoice-x/invoice2data#installation-of-input-modules
 click E https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#options
 click tables https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#tables
 click lines https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#lines
 click regex https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#regex
 click static https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#parser-static

Loading

Will make it in a separate PR so we can discuss it there.

This greatly increases performance as an optimized_str is only generated on the matched template instead of on all templates.
@bosd bosd merged commit 29a26ba into invoice-x:master Mar 18, 2023
@bosd bosd deleted the perf-keyword-match branch March 18, 2023 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants