Skip to content

Conversation

@asisdrico
Copy link

Proposed Changes:

the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds.

```python
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)
```

How did you test it?

The error first came up when I tried to convert the following document with the TikaDocumentConverter:

https://www.koalitionsvertrag2025.de/sites/www.koalitionsvertrag2025.de/files/koav_2025.pdf

Running Tika in docker with tesseract activated, tika first runs OCR on the images on the first page. After that it processes the text. I always ran into the 60 second timeout limit of the http connection configured per default in the tika library (https://github.com/chrismattmann/tika-python) in the callServer method:

requestOptionsDefault = {
'timeout': 60,
'headers': headers,
'verify': False
}

Depending on the performance of the machine used setting the timeout to 300 seconds, the conversion of above document worked flawlessly.

Checklist

Yes

  • I have updated the docstrings.

Yes

  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.

Yes

  • I have documented my code.

Yes

Yes

Yes

@asisdrico asisdrico requested a review from a team as a code owner January 3, 2026 10:47
@asisdrico asisdrico requested review from sjrl and removed request for a team January 3, 2026 10:47
@vercel
Copy link

vercel bot commented Jan 3, 2026

@asisdrico is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLAassistant commented Jan 3, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the type:documentation Improvements on the docs label Jan 3, 2026
@sjrl sjrl self-assigned this Jan 5, 2026
Comment on lines +1 to +11
---
enhancements: >
the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail
with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the
TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds.
```python
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rephrase the text. And the library we use for our release notes relies on ReStructuredText markdown formatting which I've updated

Suggested change
---
enhancements: >
the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail
with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the
TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds.
```python
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)
```
---
enhancements:
- |
The ``TikaDocumentConverter`` now supports a configurable connection timeout. This helps prevent conversion failures for long-running documents caused by Tika's default 60 second timeout. The default remains unchanged.
.. code-block:: python
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter(tika_url=tika_url, timeout=300)

# we extract the content as XHTML to preserve the structure of the document as much as possible
# this works for PDFs, but does not work for other file types (DOCX)

requestOptions = {"headers": {}, "timeout": self.timeout, "verify": False}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the headers and verify keys needed? They seem unrelated to the timeout feature.

@sjrl
Copy link
Contributor

sjrl commented Jan 5, 2026

@asisdrico thanks for the changes!

For testing one of our integration tests in test/components/converters/test_tika_document_converter.py should be updated to use the new timeout option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants