-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with other templates on [Windows] #566
Comments
Hi, Did you verify your installation of invoice2data is running properly, by testing I on one of the example files? |
Yes it is running properly. |
Your invoked command seems ok. Some debugging steps Is your pdf file a text based file? or does it need ocr? |
My pdf file is text based file. So in in.invoicedemo.yml file i have woked on regex expressions and keywords according to my pdf . |
When you run invoice2data on the pdf file with the --debug flag, do you see the contents of the file in your logger/terminal? |
No , i cannot see contents of the file. ←[1;43mWARNING:←[0mocrmypdf._pipeline:←[1;43m This PDF is marked as a Tagged PDF. This often indicates that the PDF was generated from an office document and does not need OCR. PDF pages processed by OCRmyPDF may not be tagged correctly.←[0m |
The result from pdftotext is empty. So you're likely running into dependency issues from pdftotext / poppler utils on windows. There is an open pr to enhance support. But tests are failling. I'm a linux user. So cannot give you a lot of support on windows. |
But existing templates are working fine . There is one file : |
Just creating the templates should be fine. Let's check if the template you have created has been loaded. Do you see your template in the list of loaded templates? |
Loaded templates meaning ? -- D:\invoice2data-master\src\invoice2data\extract\templates\in\in.demovoice.yml -- this one i can see.. But not able to see here: Why? So is there anything i need to follow up ? |
Because you need to check if the template you have created is properly loaded. Check if your pointing to the correct folder. You should see your template in that list. |
Even after i deleted my templates still it is parsing existing pdf . |
You have to verify if your template is being loaded.
|
Are you pointing to the correct folder? -- yes But not able to understand when i deleted existing templates for my test purpose, still its working , so i have doubt how is it possible? |
\ But not able to understand when i deleted existing templates for my test purpose, still its working , so i have doubt how is it possible? That sounds like a folder issue. Maybe it is installed in different versions or locations. What is the path which shows when you do Is that the same location as where you where deleting the files? |
My template location path is : |
No, because your standard templates are loaded from the directory in the screenshot. For easy testing gi to that location and delete the standard templates there. Or add your own custom ones there. |
Steps to add new template
To add a new template, we recommend this workflow:
1. Copy existing template to new file
Find a template that is roughly similar to what you need and copy it to
a new file. It's good practice to use reverse domain notation. E.g.
country.company.division.language.yml
orfr.mobile.enterprise.french.yml
. Language is not always needed.Template folder are searched recursively for files ending in
.yml
.2. Change invoice issuer
Just used in the output. Best to use the company name.
3. Set keyword
Look at the invoice and find the best identifying string. Tax number +
company name are good options. Remember, all keywords need to be found
for the template to be used.
Keywords are compared before processing the extracted text.
4. First test run
Now we're ready to see how far we are off. Run
invoice2data
with thefollowing debug command to see if your keywords match and how much work
is needed for dates, etc.
invoice2data --template-folder tpl --debug invoice-XXX.pdf
This test run shows you how the program will "see" the text in the
invoice. Parsing PDFs is sometimes a bit unpredictable. Also make sure
your template is used. You should already receive some data from static
fields or currencies.
5. Add regular expressions
Now you can use the debugging text to add regex fields for the
information you need. It's a good idea to copy parts of the text
directly from the debug output and then replace the dynamic parts with
regex. Keep in mind that some characters need escaping. To test, re-run
the above command.
date
field: First capture the date. Then see ifdateparser
handles it correctly. If not, add your format or language under
options.
amount
: Capture the number without currency code. If you expecthigh amounts, replace the thousand separator. Currently we don't
parse numbers via locals (TODO)
6. Done
Now you're ready to commit and push your template, so others get a
chance to use and improve it.
My Question:
I have added new template in yml with regex accordingly but when i am parsing that invoice pdf it is not parsing showing error .
Error message:
(invoice2data-env) D:\invoice2data-master\src\invoice2data>invoice2data --output-format csv --output-name output/invoices.csv input/demoinvoice.pdf
←[94mINFO:←[0minvoice2data.extract.loader:←[94m Loaded 189 templates from D:\invoice2data-master\invoice2data-env\Lib\site-packages\invoice2data\extract\templates←[0m
←[94mINFO:←[0mpikepdf._core:←[94m pikepdf C++ to Python logger bridge initialized←[0m
Scanning contents ---------------------------------------- 100% 1/1 0:00:00
←[1;43mWARNING:←[0mocrmypdf._pipeline:←[1;43m This PDF is marked as a Tagged PDF. This often indicates that the PDF was generated from an office document and does not need OCR. PDF pages processed by OCRmyPDF may not be tagged correctly.←[0m
OCR ---------------------------------------- 0% 0/1 -:--:--←[1;43mWARNING:←[0mocrmypdf._pipeline:←[1;43m Weighted average image DPI is 152.1, max DPI is 247.7. The discrepancy may indicate a high detail region on this page, but could also indicate a problem with the input PDF file. Page image will be rendered at 400.0 DPI.←[0m
OCR ---------------------------------------- 100% 1/1 0:00:00
Linearizing ---------------------------------------- 100% 100/100 0:00:00
←[94mINFO:←[0minvoice2data.input.ocrmypdf:←[94m Text extraction made with ocrmypdf←[0m
←[1;41mERROR:←[0mroot:←[1;41m No template for input/demoinvoice.pdf←[0m
The text was updated successfully, but these errors were encountered: