Error while inserting data with recipe text-extraction on some pdfs #79

AdrienMtgn · 2024-03-15T14:24:37Z

Hi!

I got some errors while trying to use the text-extraction recipe on a folder full of pdfs.

You can find a job diagnostic here

I found out that the error was raised while writting data to the output dataset, hence I printed the extracted text to the logs to try to understand what was happening, and here is the faulty text :

DĂĚĂŵĞ͕DŽŶƐŝĞƵƌ͕
�ŽŶƚƌĂƚͬWŽůŝĐĞ͗Z
dǇƉĞĚĞĐŽŶƚƌĂƚ͗ϲ
�ĐŚĠĂŶĐĞƉƌŝŶĐŝƉĂůĞ͗ϲ
DŽƚŝĨĚĞƌĠƐŝůŝĂƚŝŽŶ͗ϲ
^ŽƵƐĐƌŝƚĂƵƉƌğƐĚĞ͗ϲ
^ŝŐŶĂƚƵƌĞĚƵĐůŝĞŶƚ;ƉƌĠĐĠĚĠĞĚĞůĂŵĞŶƚŝŽŶΗ�ŽŶƉŽƵƌŵĂŶĚĂƚΗͿϲ
ƉŽƵƌůΖĞŶƐĞŵďůĞĚĞƐƌŝƐƋƵĞƐƋƵΖŝůĐŽƵǀƌĞ͘ϲ
�ĂƚĞ͗ϲ
�͗ϲ
EΣĚĞƐĠĐƵƌŝƚĠƐŽĐŝĂůĞ͗ϯ
�ůΖĂƚƚĞŶƚŝŽŶĚĞ͗zŽŶŝǀĞƌƐ�ƐƐƵƌĂŶĐĞ
address
ϳϱϬϬ5W�Z/^
Echéance principale : Cette résiliation prendra effet à l'échéance principale
du contrat.
contract_number Le 04/08/2023
Je soussigné(e) name, n° de sécurité sociale ssnumber demeurant au address, pour
agir en mon nom et pour mon compte afin de résilier mes contrats.
more names and addresses
��

Some characters can't be printed, I guess this is why they can't be inserted in the dataiku dataset.
I can't give you the file because there are some sensitive, but anyway I don't think it would be of any use as the extracted text is a pdfium problem.

But I was wondering if some king of test to see if any non printable character are in extracted text and remove then or send an error if any is found.

It could be added at line 46 and 58 in text-extraction/recipy.py file and tested right after checking if text/chunk is empty.
Here is what I did in plugin devellopment to get what I wanted (only on non chucked version):

### Maybe add this as a parameter somewhere
auto_remove_non_printable = False

### This regexp matchs ascii characters 0 to 7 and 14 to 31 (most of the conrtol characters)
### maybe 14 and 15 should be kept, maybe others should be kept
non_valid_char_regexp = r'[\x00-\x08\x0E-\x1F]'

### functions could be added to some other files, maybe  python-lib/text-extraction/__init__.py
def has_non_printable_chars(text):
    non_printable_chars = re.findall(non_valid_char_regexp, text)
    if non_printable_chars:
        for char in non_printable_chars:
            logger.warning(f"Non-printable character '{char}' found in the text")
    return bool(non_printable_chars)

def remove_non_printable_chars(text):
    return re.sub(non_valid_char_regexp, '', text)


## And this is how I modified the function: 

            if not extracted_text.strip():
                logger.warning("Extracted text is empty")
            elif(has_non_printable_chars(extracted_text)):
                logger.info("Extracted_text : ## {} ##".format(extracted_text))
                if(auto_remove_non_printable):
                    remove_non_printable_chars(extracted_text)
                else:
                    raise ValueError("Non printable characters found in extracted text")

With this code added, the recipe dosn't fail and I can get the cleaned text if I want to. It doesn't seem like a really good idea tho, you could have only part of the text and not realise.

Anyway, hope it helped,

Regards,

Adrien

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while inserting data with recipe text-extraction on some pdfs #79

Error while inserting data with recipe text-extraction on some pdfs #79

AdrienMtgn commented Mar 15, 2024

Error while inserting data with recipe text-extraction on some pdfs #79

Error while inserting data with recipe text-extraction on some pdfs #79

Comments

AdrienMtgn commented Mar 15, 2024