Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when loading from directory/folder #2

Closed
keyvan-najafy opened this issue Oct 16, 2024 · 1 comment
Closed

Error when loading from directory/folder #2

keyvan-najafy opened this issue Oct 16, 2024 · 1 comment

Comments

@keyvan-najafy
Copy link

keyvan-najafy commented Oct 16, 2024

Hi,
Before I begin, thank you for creating such a great package!
I'm trying to load bunch of pdfs from google drive into google Colab and extract their tables.
when I run for a single pdf (thus using load_from_file functionality) everything works great but when I give load_pdfs_images function a directory path I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-18-7ff012d2aaf5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 images, highres_images, names, text_lines = load_pdfs_images(input_dir)


1 frames
[/usr/local/lib/python3.10/dist-packages/surya/input/load.py](https://localhost:8080/#) in load_from_folder(folder_path, max_pages, start_page, dpi, load_text_lines)
     69             images.extend(image)
     70             names.extend(name)
---> 71             text_lines.extend(text_line)
     72         else:
     73             try:


TypeError: 'NoneType' object is not iterable

In surya/input/load.py , load_from_folder (as showing in above error) calls load_pdf function in the same script to assign a value to text_line variable.
Inside load_pdf function, following codes generate None of text_lines variable:

    if load_text_lines:
        from surya.input.pdflines import get_page_text_lines # Putting import here because pypdfium2 causes warnings if its not the top import
        text_lines = get_page_text_lines(
            pdf_path,
            page_indices,
            [i.size for i in images]
        )

It seems that get_page_text_lines returns None when it reaches empty PDF pages which raises error for the whole process rather than skipping the empty page

@VikParuchuri
Copy link
Owner

Will fix this shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants