Skip to content

[BUG] raise ValueError("The PDF parser must valorize the standard metadata.") at table hybrid parse #1096

@vkehfdl1

Description

@vkehfdl1

Describe the bug
At the hybrid parse, the below error occured in the latest langchain-community package.

raise ValueError("The PDF parser must valorize the standard metadata.")

To Reproduce
Steps to reproduce the behavior:

  1. Update langchain-community package to the latest.
  2. Run test_table_hybrid_parse.py test code.

Expected behavior
The code will be run without error and success to the test.

Full Error log

======================= 1 failed, 11 warnings in 10.83s ========================

-------------------------------- live log call ---------------------------------
INFO     AutoRAG:base.py:23 Running parser - table_hybrid_parse module...
FAILED                                                                   [100%][03/03/25 15:17:51] INFO     [__init__.py:81] >> You are using    __init__.py:81
                             API version of AutoRAG.To use local                
                             version, run pip install                           
                             'AutoRAG[gpu]'                                     
[03/03/25 15:17:55] INFO     [__init__.py:81] >> You are using    __init__.py:81
                             API version of AutoRAG.To use local                
                             version, run pip install                           
                             'AutoRAG[gpu]'                                     

tests/autorag/data/parse/test_table_hybrid_parse.py:32 (test_table_hybrid_parse_node)
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/jeffrey/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/jeffrey/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/Users/jeffrey/PycharmProjects/AutoRAG/autorag/autorag/data/parse/langchain_parse.py", line 58, in langchain_parse_pure
    documents = parse_instance.load()
  File "/Users/jeffrey/PycharmProjects/AutoRAG/venv310/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
  File "/Users/jeffrey/PycharmProjects/AutoRAG/venv310/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 682, in lazy_load
    yield from self.parser.lazy_parse(blob)
  File "/Users/jeffrey/PycharmProjects/AutoRAG/venv310/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 793, in lazy_parse
    metadata=_validate_metadata(doc_metadata),
  File "/Users/jeffrey/PycharmProjects/AutoRAG/venv310/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 140, in _validate_metadata
    raise ValueError("The PDF parser must valorize the standard metadata.")
ValueError: The PDF parser must valorize the standard metadata.
"""

The above exception was the direct cause of the following exception:

    def test_table_hybrid_parse_node():
>   	result_df = table_hybrid_parse(hybrid_glob, file_type="pdf", **table_hybrid_params)

autorag/data/parse/test_table_hybrid_parse.py:34: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../autorag/autorag/utils/util.py:72: in wrapper
    results = func(*args, **kwargs)
../autorag/autorag/data/parse/base.py:65: in wrapper
    result = func(data_path_list=data_paths, **kwargs)
../autorag/autorag/data/parse/table_hybrid_parse.py:54: in table_hybrid_parse
    text_results, text_file_path = get_each_module_result(
../autorag/autorag/data/parse/table_hybrid_parse.py:123: in get_each_module_result
    texts, path, _ = module_original(data_path_list, **module_params)
../autorag/autorag/data/parse/langchain_parse.py:30: in langchain_parse
    results = pool.starmap(
../../../.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py:375: in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <multiprocessing.pool.MapResult object at 0x1682b3b50>, timeout = None

    def get(self, timeout=None):
        self.wait(timeout)
        if not self.ready():
            raise TimeoutError
        if self._success:
            return self._value
        else:
>           raise self._value
E           ValueError: The PDF parser must valorize the standard metadata.

../../../.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py:774: ValueError

Code that bug is happened
Go to langchain_parse.py

Desktop (please complete the following information):

  • OS: MacOS
  • Python version 3.10

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions