Skip to content

Adding max queue time parameter#4190

Closed
KrishnaM251 wants to merge 38 commits intovllm-project:mainfrom
KrishnaM251:max-queue-len
Closed

Adding max queue time parameter#4190
KrishnaM251 wants to merge 38 commits intovllm-project:mainfrom
KrishnaM251:max-queue-len

Conversation

@KrishnaM251
Copy link
Copy Markdown
Contributor

@KrishnaM251 KrishnaM251 commented Apr 19, 2024

FIX #2901
[Core] [Frontend]

Description of Changes
I added a new field to EngineArgs called max_queue_length. If an attempt to queue more requests than max_queue_length is made, then an error is thrown. If an OpenAI compatible API server is being used, then an error 503 will be returned for the requests that would have exceeded the max_queue_length + max_num_seqs.

Tests

  • engine/test_max_queue_length.py
    • throws error if an attempt to exceed waiting queue is made
    • based heavily off of llm_engine_example.py
  • test_max_queue_length() in test_openai_server.py
    • before running this test, make sure to
      • uncomment line 79 (MAX_QUEUE_LEN)
      • uncomment the new --max-queue-length param and the subsequent --max-num-seqs param (lines 118 - 121) This --max-num-seqs param is necessary to ensure that the running queue can only hold one request at a time, forcing the waiting queue to hold the rest. max_queue_length controls the max length of the waiting queue.
      • comment out the original “--max-num-seqs” (lines 116 and 117).

Sync Changes

  • args_utils.py - add a new parameter called max_queue_length to EngineArgs
    • EngineArgs.max_queue_length
    • parser.add_argument() max_queue_length
  • scheduler.py - add param to SchedulerConfig object
    • SchedulerConfig init()
  • config.py
    • SchedulerConfig verifyArgs()
    • get_max_queue_length
  • llm_engine.py
    • check if exceeds max_queue_length in _add_processed_request
  • engine_args.rst
    • for parsing cli args

Async Changes

  • async_llm_engine.py
    • check for queue overflow error
  • serving_chat.py
    • add status to create_error_reponse()
  • serving_completion.py
    • add status to create_error_reponse()

Comment on lines +111 to +147
sample_chats = [[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": "Who won the world series in 2020?"
}],
[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": "Where was the 2020 world series played?"
}],
[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": "How long did the 2020 world series last?"
}],
[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role":
"user",
"content":
"What were some television viewership statistics?"
}],
[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": "Why was the 2020 world series so popular?"
}]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use completion api so the test is a lot shorter.

Comment on lines +149 to +171
async def make_api_call(sample_chat):
chat_completion = await client.chat.completions.create(
messages=sample_chat,
model=model_name,
temperature=0.8,
presence_penalty=0.2,
max_tokens=400,
)
return chat_completion

async def main():
coroutines = [
make_api_call(sample_chat) for sample_chat in sample_chats
]

responses = await asyncio.gather(*coroutines, return_exceptions=True)

for response in responses:
logger.info(response)
if isinstance(response, JSONResponse):
assert response.status_code == 503

await main() No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are already in sync context. You can make it a lot easier by just doing

coroutines = [ client.chat.completions.create(
            messages=sample_chat,
            model=model_name,
            temperature=0.8,
            presence_penalty=0.2,
            max_tokens=400,
        ) for sample_chat in sample_chats]
responses = asyncio.gather(*coroutines, ...)
for ...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also because you are using the server, you should just move the test into test_openai_server.py in entrypoints to reuse the same server.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

vllm/config.py Outdated
Comment on lines +610 to +612
# TODO: verify max_queue_length


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@simon-mo simon-mo self-assigned this Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Controlling max queue time

2 participants