feat: Added a configurable connection timeout to TikaDocumentConverter #10294

sjrl · 2026-01-05T07:31:32Z

Are the headers and verify keys needed? They seem unrelated to the timeout feature.

sjrl · 2026-01-05T07:30:02Z

Let's rephrase the text. And the library we use for our release notes relies on ReStructuredText markdown formatting which I've updated

Suggested change

---

enhancements: >

the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail

with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the

TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds.

```python

from haystack.components.converters.tika import TikaDocumentConverter

converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)

```

---

enhancements:

- |

The ``TikaDocumentConverter`` now supports a configurable connection timeout. This helps prevent conversion failures for long-running documents caused by Tika's default 60 second timeout. The default remains unchanged.

.. code-block:: python

from haystack.components.converters.tika import TikaDocumentConverter

converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)

-Original file line number
+Diff line change
@@ Expand Up / @@ -74,7 +74,7 @@ class TikaDocumentConverter: @@
         ```
         """
-        def __init__(self, tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False):
+        def __init__(self, tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False, timeout: int = 60):
             """
             Create a TikaDocumentConverter component.
@@ Expand All @@
             :param store_full_path:
                 If True, the full path of the file is stored in the metadata of the document.
                 If False, only the file name is stored.
+            :param timeout:
+                Timeout for Tika server requests.
             """
             tika_import.check()
             self.tika_url = tika_url
             self.store_full_path = store_full_path
+            self.timeout = timeout
         @component.output_types(documents=list[Document])
         def run(self, sources: list[str | Path | ByteStream], meta: dict[str, Any] | list[dict[str, Any]] | None = None):
@@ Expand Down Expand Up @@
                 try:
                     # we extract the content as XHTML to preserve the structure of the document as much as possible
                     # this works for PDFs, but does not work for other file types (DOCX)
+                    requestOptions = {"headers": {}, "timeout": self.timeout, "verify": False}
                     xhtml_content = tika_parser.from_buffer(
-                        io.BytesIO(bytestream.data), serverEndpoint=self.tika_url, xmlContent=True
+                        io.BytesIO(bytestream.data),
+                        serverEndpoint=self.tika_url,
+                        xmlContent=True,
+                        requestOptions=requestOptions,
                     )["content"]
                     xhtml_parser = XHTMLParser()
                     xhtml_parser.feed(xhtml_content)
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added a configurable connection timeout to TikaDocumentConverter #10294

Diff view

Diff view

There are no files selected for viewing

sjrl Jan 5, 2026

Uh oh!

sjrl Jan 5, 2026

Uh oh!

Uh oh!

feat: Added a configurable connection timeout to TikaDocumentConverter #10294

Are you sure you want to change the base?

feat: Added a configurable connection timeout to TikaDocumentConverter #10294

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

sjrl Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

sjrl Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!