You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pdf_miner_parser = PDFMinerParser(extract_images=True)
with open("examp.pdf") as f:
blob = Blob(data=f.read())
pdf_miner_parser.parse(blob)
Error Message and Stack Trace (if applicable)
File "agents/pdf2md/pdf2md_agent.py", line 30, in _process_file
documents = pdf_miner_parser.parse(blob)
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
return list(self.lazy_parse(blob))
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 215, in lazy_parse
content = text_io.getvalue() + self._extract_images_from_page(
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 239, in _extract_images_from_page
if img.stream["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
AttributeError: 'list' object has no attribute 'name'
Description
I think I can fix this bug
System Info
python -m langchain_core.sys_info
System Information
OS: Linux
OS Version: #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023
Python Version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
Thank you for contributing to LangChain!
**PR title**: "community: fix PDF Filter Type Error"
- **Description:** fix PDF Filter Type Error"
- **Issue:** the issue #27153 it fixes,
- **Dependencies:** no
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
---------
Co-authored-by: Erick Friis <[email protected]>
Checked other resources
Example Code
example.pdf
Error Message and Stack Trace (if applicable)
File "agents/pdf2md/pdf2md_agent.py", line 30, in _process_file
documents = pdf_miner_parser.parse(blob)
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
return list(self.lazy_parse(blob))
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 215, in lazy_parse
content = text_io.getvalue() + self._extract_images_from_page(
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 239, in _extract_images_from_page
if img.stream["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
AttributeError: 'list' object has no attribute 'name'
Description
I think I can fix this bug
System Info
python -m langchain_core.sys_info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: