Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions haystack/components/converters/tika.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ class TikaDocumentConverter:
```
"""

def __init__(self, tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False):
def __init__(self, tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False, timeout: int = 60):
"""
Create a TikaDocumentConverter component.

Expand All @@ -83,10 +83,13 @@ def __init__(self, tika_url: str = "http://localhost:9998/tika", store_full_path
:param store_full_path:
If True, the full path of the file is stored in the metadata of the document.
If False, only the file name is stored.
:param timeout:
Timeout for Tika server requests.
"""
tika_import.check()
self.tika_url = tika_url
self.store_full_path = store_full_path
self.timeout = timeout

@component.output_types(documents=list[Document])
def run(self, sources: list[str | Path | ByteStream], meta: dict[str, Any] | list[dict[str, Any]] | None = None):
Expand Down Expand Up @@ -119,8 +122,14 @@ def run(self, sources: list[str | Path | ByteStream], meta: dict[str, Any] | lis
try:
# we extract the content as XHTML to preserve the structure of the document as much as possible
# this works for PDFs, but does not work for other file types (DOCX)

requestOptions = {"headers": {}, "timeout": self.timeout, "verify": False}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the headers and verify keys needed? They seem unrelated to the timeout feature.


xhtml_content = tika_parser.from_buffer(
io.BytesIO(bytestream.data), serverEndpoint=self.tika_url, xmlContent=True
io.BytesIO(bytestream.data),
serverEndpoint=self.tika_url,
xmlContent=True,
requestOptions=requestOptions,
)["content"]
xhtml_parser = XHTMLParser()
xhtml_parser.feed(xhtml_content)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
enhancements: >
the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail
with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the
TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds.

```python
from haystack.components.converters.tika import TikaDocumentConverter

converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)
```
Comment on lines +1 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rephrase the text. And the library we use for our release notes relies on ReStructuredText markdown formatting which I've updated

Suggested change
---
enhancements: >
the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail
with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the
TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds.
```python
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)
```
---
enhancements:
- |
The ``TikaDocumentConverter`` now supports a configurable connection timeout. This helps prevent conversion failures for long-running documents caused by Tika's default 60 second timeout. The default remains unchanged.
.. code-block:: python
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)

Loading