Adding max queue time parameter#4190
Closed
KrishnaM251 wants to merge 38 commits intovllm-project:mainfrom
Closed
Conversation
simon-mo
reviewed
Apr 19, 2024
Comment on lines
+111
to
+147
| sample_chats = [[{ | ||
| "role": "system", | ||
| "content": "You are a helpful assistant." | ||
| }, { | ||
| "role": "user", | ||
| "content": "Who won the world series in 2020?" | ||
| }], | ||
| [{ | ||
| "role": "system", | ||
| "content": "You are a helpful assistant." | ||
| }, { | ||
| "role": "user", | ||
| "content": "Where was the 2020 world series played?" | ||
| }], | ||
| [{ | ||
| "role": "system", | ||
| "content": "You are a helpful assistant." | ||
| }, { | ||
| "role": "user", | ||
| "content": "How long did the 2020 world series last?" | ||
| }], | ||
| [{ | ||
| "role": "system", | ||
| "content": "You are a helpful assistant." | ||
| }, { | ||
| "role": | ||
| "user", | ||
| "content": | ||
| "What were some television viewership statistics?" | ||
| }], | ||
| [{ | ||
| "role": "system", | ||
| "content": "You are a helpful assistant." | ||
| }, { | ||
| "role": "user", | ||
| "content": "Why was the 2020 world series so popular?" | ||
| }]] |
Collaborator
There was a problem hiding this comment.
use completion api so the test is a lot shorter.
Comment on lines
+149
to
+171
| async def make_api_call(sample_chat): | ||
| chat_completion = await client.chat.completions.create( | ||
| messages=sample_chat, | ||
| model=model_name, | ||
| temperature=0.8, | ||
| presence_penalty=0.2, | ||
| max_tokens=400, | ||
| ) | ||
| return chat_completion | ||
|
|
||
| async def main(): | ||
| coroutines = [ | ||
| make_api_call(sample_chat) for sample_chat in sample_chats | ||
| ] | ||
|
|
||
| responses = await asyncio.gather(*coroutines, return_exceptions=True) | ||
|
|
||
| for response in responses: | ||
| logger.info(response) | ||
| if isinstance(response, JSONResponse): | ||
| assert response.status_code == 503 | ||
|
|
||
| await main() No newline at end of file |
Collaborator
There was a problem hiding this comment.
You are already in sync context. You can make it a lot easier by just doing
coroutines = [ client.chat.completions.create(
messages=sample_chat,
model=model_name,
temperature=0.8,
presence_penalty=0.2,
max_tokens=400,
) for sample_chat in sample_chats]
responses = asyncio.gather(*coroutines, ...)
for ...
Collaborator
There was a problem hiding this comment.
Also because you are using the server, you should just move the test into test_openai_server.py in entrypoints to reuse the same server.
vllm/config.py
Outdated
Comment on lines
+610
to
+612
| # TODO: verify max_queue_length | ||
|
|
||
|
|
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FIX #2901
[Core] [Frontend]
Description of Changes
I added a new field to EngineArgs called
max_queue_length. If an attempt to queue more requests thanmax_queue_lengthis made, then an error is thrown. If an OpenAI compatible API server is being used, then an error 503 will be returned for the requests that would have exceeded themax_queue_length + max_num_seqs.Tests
test_max_queue_length()in test_openai_server.py--max-queue-lengthparam and the subsequent--max-num-seqsparam (lines 118 - 121) This--max-num-seqsparam is necessary to ensure that therunningqueue can only hold one request at a time, forcing thewaitingqueue to hold the rest.max_queue_lengthcontrols the max length of the waiting queue.Sync Changes
Async Changes