Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change server approach to handle parallel requests #1550

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sergey-zinchenko
Copy link

I have made a change to the way the server handles concurrent requests. In this PR, arriving requests will wait for the model's global async lock. I.e., requests will be organized in something like a queue. On top of that, I added a configuration for the unicorn to have only ten concurrent requests. So finally, up to ten parallel requests will await like "in a queue" for the model lock, and the current request will not be interrupted. If 11's request arrives, the server will send out 503 response immediately. This approach suits the common scenarios with multiuser chatbot UI and API access.
I also changed some other stuff to fix PEP warnings by linter in IDE.

@sergey-zinchenko sergey-zinchenko changed the title Change server approach to handle parallel request Change server approach to handle parallel requests Jun 24, 2024
@sergey-zinchenko
Copy link
Author

@abetlen What do you think about that changes?

@gerdemann
Copy link

Hey, thanks for this pr. Is it possible that we can get the pr merged? 😄

@sergey-zinchenko
Copy link
Author

@gerdemann @Smartappli Hi! I authored this PR two month ago) Looks like it has some conflicts now. I can fix it today if there is somebody who can merge it right after that)

@sergey-zinchenko
Copy link
Author

@gerdemann @Smartappli and I se some activity during that two month related to the way how server handles parallel requests in main branch. Is that still an issue?

@gerdemann
Copy link

gerdemann commented Aug 19, 2024

I still get this error when two requests are made at the same time:

disconnected
Disconnected from client (via refresh/close) Address(host='10.32.20.82', port=58506)
ERROR:    ASGI callable returned without completing response.
Llama.generate: 64 prefix-match hit, remaining 45 prompt tokens to eval

I tried to install your branch directly and test it. But I get this error:

Exception: Task <Task pending name='Task-7' coro=<RequestResponseCycle.run_asgi() running at /llama/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py:406> cb=[set.discard()]> got Future <Future pending> attached to a different loop
Traceback (most recent call last):
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/errors.py", line 170, in custom_route_handler
    response = await original_route_handler(request)
  File "/llama/lib/python3.9/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/llama/lib/python3.9/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 482, in create_chat_completion
    return await handle_completion_request(request, body,
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 250, in handle_completion_request
    async for response in completion_iter:
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 203, in completion_async_generator
    async with llama_proxy_context_manager as llama_proxy:
  File "/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 79, in __aenter__
    await self._lock.acquire()
  File "/llama/lib/python3.9/asyncio/locks.py", line 120, in acquire
    await fut
RuntimeError: Task <Task pending name='Task-7' coro=<RequestResponseCycle.run_asgi() running at /llama/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py:406> cb=[set.discard()]> got Future <Future pending> attached to a different loop
INFO:     10.32.20.82:37322 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

Do you have any idea what I am doing wrong?

@Fu-Cheng
Copy link

RuntimeError: Task <Task pending name='Task-7' coro=<RequestResponseCycle.run_asgi() running at /llama/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py:406> cb=[set.discard()]> got Future attached to a different loop
INFO: 10.32.20.82:37322 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

@sergey-zinchenko,

Hi, I encountered the same issue. The service is still not handling concurrent requests properly. When I send a second request while the LLM is still generating a response for the first request, I receive this error.

@gjpower
Copy link

gjpower commented Oct 23, 2024

I have implemented a smaller alternative change that solves the same problem in #1798

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants