-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance update: match keywords on extracted_str #470
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's acceptable. If some invoices require more complex matching, one can always use a regex in the keywords
(just like in com.flipkart.WSRetail.yml
as modified in this pull request).
@m3nu: do you have an opinion on this?
Ping @m3nu |
Should be OK to match an invoice *before* doing the optimizations for extractions. Maybe add a graph of the steps at some point to make it quicker to understand and start?
|
I am thinking of something like this: flowchart LR
InvoiceFile[fa:fa-file-invoice Invoicefile\n\npdf\nimage\ntext] --> Input-module(Input Module\n\npdftotext\ntext\npdfminer\npdfplumber\ntesseract\ngvision)
Input-module --> |Extracted Text| C{keyword\nmatching}
Invoice-Templates[fa:fa-file-lines Invoice Templates] --> C{keyword\nmatching}
C --> |Extracted Text + fa:fa-file-circle-check Template| E(Template Processing\n apply options from template\nremove accents, replaces etc...)
E --> |Optimized String|Plugins&Parsers(Call plugins + parsers)
subgraph Plugins&Parsers
direction BT
tables[fa:fa-table tables] ~~~ lines[fa:fa-grip-lines lines]
lines ~~~ regex[fa:fa-code regex]
regex ~~~ static[fa:fa-check static]
end
Plugins&Parsers --> |output| result[result\nfa:fa-file-csv,\njson,\nXML]
click Invoice-Templates https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md
click result https://github.com/invoice-x/invoice2data#usage
click Input-module https://github.com/invoice-x/invoice2data#installation-of-input-modules
click E https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#options
click tables https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#tables
click lines https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#lines
click regex https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#regex
click static https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#parser-static
Will make it in a separate PR so we can discuss it there. |
89dcd22
to
ff85aef
Compare
This greatly increases performance as an optimized_str is only generated on the matched template instead of on all templates.
ff85aef
to
b59cc9a
Compare
Before this PR for each individual template an optimized string was generated.
This impacts the performance negatively, specifically if one has a lot of templates.
This PR greatly increases performance as an optimized_str is only generated on the matched template instead of on all templates.
On my local system, I realized a 2x performance increase.