Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while inserting data with recipe text-extraction on some pdfs #79

Open
AdrienMtgn opened this issue Mar 15, 2024 · 0 comments
Open

Comments

@AdrienMtgn
Copy link

Hi!

I got some errors while trying to use the text-extraction recipe on a folder full of pdfs.

You can find a job diagnostic here

I found out that the error was raised while writting data to the output dataset, hence I printed the extracted text to the logs to try to understand what was happening, and here is the faulty text :

DĂĚĂŵĞ͕DŽŶƐŝĞƵƌ͕
�ŽŶƚƌĂƚͬWŽůŝĐĞ͗Z
dLJƉĞĚĞĐŽŶƚƌĂƚ͗ϲ
�ĐŚĠĂŶĐĞƉƌŝŶĐŝƉĂůĞ͗ϲ
DŽƚŝĨĚĞƌĠƐŝůŝĂƚŝŽŶ͗ϲ
^ŽƵƐĐƌŝƚĂƵƉƌğƐĚĞ͗ϲ
^ŝŐŶĂƚƵƌĞĚƵĐůŝĞŶƚ;ƉƌĠĐĠĚĠĞĚĞůĂŵĞŶƚŝŽŶΗ�ŽŶƉŽƵƌŵĂŶĚĂƚΗͿϲ
ƉŽƵƌůΖĞŶƐĞŵďůĞĚĞƐƌŝƐƋƵĞƐƋƵΖŝůĐŽƵǀƌĞ͘ϲ
�ĂƚĞ͗ϲ
�͗ϲ
EΣĚĞƐĠĐƵƌŝƚĠƐŽĐŝĂůĞ͗ϯ
�ůΖĂƚƚĞŶƚŝŽŶĚĞ͗zŽŶŝǀĞƌƐ�ƐƐƵƌĂŶĐĞ
address
ϳϱϬϬ5W�Z/^
Echéance principale : Cette résiliation prendra effet à l'échéance principale
du contrat.
contract_number Le 04/08/2023
Je soussigné(e) name, n° de sécurité sociale ssnumber demeurant au address, pour
agir en mon nom et pour mon compte afin de résilier mes contrats.
more names and addresses
���������������������������������������������������

Some characters can't be printed, I guess this is why they can't be inserted in the dataiku dataset.
I can't give you the file because there are some sensitive, but anyway I don't think it would be of any use as the extracted text is a pdfium problem.

But I was wondering if some king of test to see if any non printable character are in extracted text and remove then or send an error if any is found.

It could be added at line 46 and 58 in text-extraction/recipy.py file and tested right after checking if text/chunk is empty.
Here is what I did in plugin devellopment to get what I wanted (only on non chucked version):

### Maybe add this as a parameter somewhere
auto_remove_non_printable = False

### This regexp matchs ascii characters 0 to 7 and 14 to 31 (most of the conrtol characters)
### maybe 14 and 15 should be kept, maybe others should be kept
non_valid_char_regexp = r'[\x00-\x08\x0E-\x1F]'

### functions could be added to some other files, maybe  python-lib/text-extraction/__init__.py
def has_non_printable_chars(text):
    non_printable_chars = re.findall(non_valid_char_regexp, text)
    if non_printable_chars:
        for char in non_printable_chars:
            logger.warning(f"Non-printable character '{char}' found in the text")
    return bool(non_printable_chars)

def remove_non_printable_chars(text):
    return re.sub(non_valid_char_regexp, '', text)


## And this is how I modified the function: 

            if not extracted_text.strip():
                logger.warning("Extracted text is empty")
            elif(has_non_printable_chars(extracted_text)):
                logger.info("Extracted_text : ## {} ##".format(extracted_text))
                if(auto_remove_non_printable):
                    remove_non_printable_chars(extracted_text)
                else:
                    raise ValueError("Non printable characters found in extracted text")

With this code added, the recipe dosn't fail and I can get the cleaned text if I want to. It doesn't seem like a really good idea tho, you could have only part of the text and not realise.

Anyway, hope it helped,

Regards,

Adrien

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant