-
Notifications
You must be signed in to change notification settings - Fork 961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change server approach to handle parallel requests #1550
base: main
Are you sure you want to change the base?
Change server approach to handle parallel requests #1550
Conversation
@abetlen What do you think about that changes? |
Hey, thanks for this pr. Is it possible that we can get the pr merged? 😄 |
@gerdemann @Smartappli Hi! I authored this PR two month ago) Looks like it has some conflicts now. I can fix it today if there is somebody who can merge it right after that) |
@gerdemann @Smartappli and I se some activity during that two month related to the way how server handles parallel requests in main branch. Is that still an issue? |
I still get this error when two requests are made at the same time:
I tried to install your branch directly and test it. But I get this error:
Do you have any idea what I am doing wrong? |
Hi, I encountered the same issue. The service is still not handling concurrent requests properly. When I send a second request while the LLM is still generating a response for the first request, I receive this error. |
I have implemented a smaller alternative change that solves the same problem in #1798 |
I have made a change to the way the server handles concurrent requests. In this PR, arriving requests will wait for the model's global async lock. I.e., requests will be organized in something like a queue. On top of that, I added a configuration for the unicorn to have only ten concurrent requests. So finally, up to ten parallel requests will await like "in a queue" for the model lock, and the current request will not be interrupted. If 11's request arrives, the server will send out 503 response immediately. This approach suits the common scenarios with multiuser chatbot UI and API access.
I also changed some other stuff to fix PEP warnings by linter in IDE.