-
-
Notifications
You must be signed in to change notification settings - Fork 337
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
At the hybrid parse, the below error occured in the latest langchain-community package.
raise ValueError("The PDF parser must valorize the standard metadata.")
To Reproduce
Steps to reproduce the behavior:
- Update
langchain-community
package to the latest. - Run
test_table_hybrid_parse.py
test code.
Expected behavior
The code will be run without error and success to the test.
Full Error log
======================= 1 failed, 11 warnings in 10.83s ========================
-------------------------------- live log call ---------------------------------
INFO AutoRAG:base.py:23 Running parser - table_hybrid_parse module...
FAILED [100%][03/03/25 15:17:51] INFO [__init__.py:81] >> You are using __init__.py:81
API version of AutoRAG.To use local
version, run pip install
'AutoRAG[gpu]'
[03/03/25 15:17:55] INFO [__init__.py:81] >> You are using __init__.py:81
API version of AutoRAG.To use local
version, run pip install
'AutoRAG[gpu]'
tests/autorag/data/parse/test_table_hybrid_parse.py:32 (test_table_hybrid_parse_node)
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/jeffrey/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/jeffrey/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/Users/jeffrey/PycharmProjects/AutoRAG/autorag/autorag/data/parse/langchain_parse.py", line 58, in langchain_parse_pure
documents = parse_instance.load()
File "/Users/jeffrey/PycharmProjects/AutoRAG/venv310/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 31, in load
return list(self.lazy_load())
File "/Users/jeffrey/PycharmProjects/AutoRAG/venv310/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 682, in lazy_load
yield from self.parser.lazy_parse(blob)
File "/Users/jeffrey/PycharmProjects/AutoRAG/venv310/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 793, in lazy_parse
metadata=_validate_metadata(doc_metadata),
File "/Users/jeffrey/PycharmProjects/AutoRAG/venv310/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 140, in _validate_metadata
raise ValueError("The PDF parser must valorize the standard metadata.")
ValueError: The PDF parser must valorize the standard metadata.
"""
The above exception was the direct cause of the following exception:
def test_table_hybrid_parse_node():
> result_df = table_hybrid_parse(hybrid_glob, file_type="pdf", **table_hybrid_params)
autorag/data/parse/test_table_hybrid_parse.py:34:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../autorag/autorag/utils/util.py:72: in wrapper
results = func(*args, **kwargs)
../autorag/autorag/data/parse/base.py:65: in wrapper
result = func(data_path_list=data_paths, **kwargs)
../autorag/autorag/data/parse/table_hybrid_parse.py:54: in table_hybrid_parse
text_results, text_file_path = get_each_module_result(
../autorag/autorag/data/parse/table_hybrid_parse.py:123: in get_each_module_result
texts, path, _ = module_original(data_path_list, **module_params)
../autorag/autorag/data/parse/langchain_parse.py:30: in langchain_parse
results = pool.starmap(
../../../.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py:375: in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <multiprocessing.pool.MapResult object at 0x1682b3b50>, timeout = None
def get(self, timeout=None):
self.wait(timeout)
if not self.ready():
raise TimeoutError
if self._success:
return self._value
else:
> raise self._value
E ValueError: The PDF parser must valorize the standard metadata.
../../../.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py:774: ValueError
Code that bug is happened
Go to langchain_parse.py
Desktop (please complete the following information):
- OS: MacOS
- Python version 3.10
bwook00 and Gwenn-LR
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working