Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Updated ADAPTER_LLMW_MAX_POLLS to 120 for 1 hour extraction #904

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chandrasekharan-zipstack
Copy link
Contributor

What

  • Updated env to reduce extraction time from ~8.3 hours to ~1 hour

Why

  • After move to LLMW v2, multi page extractions of 1500 pages should be done in ~1 hour

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

  • No, reduced extraction time only for LLMW v2

Env Config

# ~60 mins (assuming it'll be enough to process 1500 pages with LLMW v2)
ADAPTER_LLMW_MAX_POLLS=120

Related Issues or PRs

Notes on Testing

  • No explicit testing / validation done - logic based on these envs work though

Checklist

I have read and understood the Contribution Guidelines.

@chandrasekharan-zipstack chandrasekharan-zipstack requested review from ritwik-g and a team December 18, 2024 07:52
@chandrasekharan-zipstack chandrasekharan-zipstack requested review from pk-zipstack and removed request for a team December 18, 2024 07:52
Copy link
Contributor

filepath function $$\textcolor{#23d18b}{\tt{passed}}$$ SUBTOTAL
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_logs}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_client\_init}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{worker/src/unstract/worker/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_run\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$ $$\textcolor{#23d18b}{\tt{9}}$$ $$\textcolor{#23d18b}{\tt{9}}$$

# 500 mins to allow 1500 (max pages limit) * 20 (approx time in sec to process a page)
ADAPTER_LLMW_MAX_POLLS=1000
# ~60 mins (assuming it'll be enough to process 1500 pages with LLMW v2)
ADAPTER_LLMW_MAX_POLLS=120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chandrasekharan-zipstack but what about for the V1? Doesn't it use the same ENV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritwik-g you're right, v1 uses the same envs. Shall we discourage support for v1 by changing this env (reducing the time / max possible pages) and instead enforce support with v2 for more number of pages?
Realistically speaking - I doubt if any user has such large extraction times.

Worst case,

  1. either we'll have to let this env be and take action after we sunset v1
  2. or introduce a new set of envs for v2 and update that (involves changes in the SDK, so I'm not a fan of this)

# 500 mins to allow 1500 (max pages limit) * 20 (approx time in sec to process a page)
ADAPTER_LLMW_MAX_POLLS=1000
# ~60 mins (assuming it'll be enough to process 1500 pages with LLMW v2)
ADAPTER_LLMW_MAX_POLLS=120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chandrasekharan-zipstack Will this not cause timeout if it exceeds max 15min time which we have set in gunicorn ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harini-venkataraman the timeout exists only for extractions that happens with a web UI. This large setting is mainly ideal for async pipeline based extractions where such gunicorn timeouts will not play a role

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants