-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat: Added a configurable connection timeout to TikaDocumentConverter #10294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Added a configurable connection timeout to TikaDocumentConverter #10294
Conversation
|
@asisdrico is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
| --- | ||
| enhancements: > | ||
| the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail | ||
| with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the | ||
| TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds. | ||
| ```python | ||
| from haystack.components.converters.tika import TikaDocumentConverter | ||
| converter = TikaDocumentConverter(tika_url=tika_url, timeout=300) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rephrase the text. And the library we use for our release notes relies on ReStructuredText markdown formatting which I've updated
| --- | |
| enhancements: > | |
| the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail | |
| with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the | |
| TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds. | |
| ```python | |
| from haystack.components.converters.tika import TikaDocumentConverter | |
| converter = TikaDocumentConverter(tika_url=tika_url, timeout=300) | |
| ``` | |
| --- | |
| enhancements: | |
| - | | |
| The ``TikaDocumentConverter`` now supports a configurable connection timeout. This helps prevent conversion failures for long-running documents caused by Tika's default 60 second timeout. The default remains unchanged. | |
| .. code-block:: python | |
| from haystack.components.converters.tika import TikaDocumentConverter | |
| converter = TikaDocumentConverter(tika_url=tika_url, timeout=300) |
| # we extract the content as XHTML to preserve the structure of the document as much as possible | ||
| # this works for PDFs, but does not work for other file types (DOCX) | ||
|
|
||
| requestOptions = {"headers": {}, "timeout": self.timeout, "verify": False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the headers and verify keys needed? They seem unrelated to the timeout feature.
|
@asisdrico thanks for the changes! For testing one of our integration tests in |
Proposed Changes:
the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds.
How did you test it?
The error first came up when I tried to convert the following document with the TikaDocumentConverter:
https://www.koalitionsvertrag2025.de/sites/www.koalitionsvertrag2025.de/files/koav_2025.pdf
Running Tika in docker with tesseract activated, tika first runs OCR on the images on the first page. After that it processes the text. I always ran into the 60 second timeout limit of the http connection configured per default in the tika library (https://github.com/chrismattmann/tika-python) in the callServer method:
requestOptionsDefault = {
'timeout': 60,
'headers': headers,
'verify': False
}
Depending on the performance of the machine used setting the timeout to 300 seconds, the conversion of above document worked flawlessly.
Checklist
Yes
Yes
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.Yes
Yes
Yes
Yes