PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

moyueheng · 2024-10-06T18:37:45Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

pdf_miner_parser = PDFMinerParser(extract_images=True)
with open("examp.pdf") as f:
    blob = Blob(data=f.read())
    pdf_miner_parser.parse(blob)

Error Message and Stack Trace (if applicable)

File "agents/pdf2md/pdf2md_agent.py", line 30, in _process_file
documents = pdf_miner_parser.parse(blob)
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
return list(self.lazy_parse(blob))
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 215, in lazy_parse
content = text_io.getvalue() + self._extract_images_from_page(
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 239, in _extract_images_from_page
if img.stream["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
AttributeError: 'list' object has no attribute 'name'

Description

I think I can fix this bug

System Info

python -m langchain_core.sys_info

System Information

OS: Linux
OS Version: #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023
Python Version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]

Package Information

langchain_core: 0.3.9
langchain: 0.3.2
langchain_community: 0.3.1
langsmith: 0.1.131
langchain_openai: 0.2.2
langchain_text_splitters: 0.3.0
langserve: 0.3.0

Optional packages not installed

langgraph

Other Dependencies

aiohttp: 3.10.9
async-timeout: 4.0.3
dataclasses-json: 0.6.7
fastapi: 0.115.0
httpx: 0.27.2
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.51.0
orjson: 3.10.7
packaging: 24.1
pydantic: 2.9.2
pydantic-settings: 2.5.2
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.35
sse-starlette: 1.8.2
tenacity: 8.5.0
tiktoken: 0.8.0
typing-extensions: 4.12.2

The text was updated successfully, but these errors were encountered:

Thank you for contributing to LangChain! **PR title**: "community: fix PDF Filter Type Error" - **Description:** fix PDF Filter Type Error" - **Issue:** the issue #27153 it fixes, - **Dependencies:** no - **Twitter handle:** if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <[email protected]>

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Oct 6, 2024

moyueheng mentioned this issue Oct 6, 2024

fix: 🐛 PDF Filter Type Error #27154

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

moyueheng commented Oct 6, 2024 •

edited

Loading

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

Comments

moyueheng commented Oct 6, 2024 • edited Loading

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

moyueheng commented Oct 6, 2024 •

edited

Loading