Error when loading from directory/folder #2

keyvan-najafy · 2024-10-16T01:18:04Z

Hi,
Before I begin, thank you for creating such a great package!
I'm trying to load bunch of pdfs from google drive into google Colab and extract their tables.
when I run for a single pdf (thus using load_from_file functionality) everything works great but when I give load_pdfs_images function a directory path I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-18-7ff012d2aaf5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 images, highres_images, names, text_lines = load_pdfs_images(input_dir)


1 frames
[/usr/local/lib/python3.10/dist-packages/surya/input/load.py](https://localhost:8080/#) in load_from_folder(folder_path, max_pages, start_page, dpi, load_text_lines)
     69             images.extend(image)
     70             names.extend(name)
---> 71             text_lines.extend(text_line)
     72         else:
     73             try:


TypeError: 'NoneType' object is not iterable

In surya/input/load.py , load_from_folder (as showing in above error) calls load_pdf function in the same script to assign a value to text_line variable.
Inside load_pdf function, following codes generate None of text_lines variable:

    if load_text_lines:
        from surya.input.pdflines import get_page_text_lines # Putting import here because pypdfium2 causes warnings if its not the top import
        text_lines = get_page_text_lines(
            pdf_path,
            page_indices,
            [i.size for i in images]
        )

It seems that get_page_text_lines returns None when it reaches empty PDF pages which raises error for the whole process rather than skipping the empty page

The text was updated successfully, but these errors were encountered:

VikParuchuri · 2024-10-18T13:16:28Z

Will fix this shortly

VikParuchuri closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when loading from directory/folder #2

Error when loading from directory/folder #2

keyvan-najafy commented Oct 16, 2024 •

edited

Loading

VikParuchuri commented Oct 18, 2024

Error when loading from directory/folder #2

Error when loading from directory/folder #2

Comments

keyvan-najafy commented Oct 16, 2024 • edited Loading

VikParuchuri commented Oct 18, 2024

keyvan-najafy commented Oct 16, 2024 •

edited

Loading