Adding max queue time parameter by KrishnaM251 · Pull Request #4190 · vllm-project/vllm

KrishnaM251 · 2024-04-19T05:12:09Z

FIX #2901
[Core] [Frontend]

Description of Changes
I added a new field to EngineArgs called max_queue_length. If an attempt to queue more requests than max_queue_length is made, then an error is thrown. If an OpenAI compatible API server is being used, then an error 503 will be returned for the requests that would have exceeded the max_queue_length + max_num_seqs.

Tests

engine/test_max_queue_length.py
- throws error if an attempt to exceed waiting queue is made
- based heavily off of llm_engine_example.py
test_max_queue_length() in test_openai_server.py
- before running this test, make sure to
  - uncomment line 79 (MAX_QUEUE_LEN)
  - uncomment the new --max-queue-length param and the subsequent --max-num-seqs param (lines 118 - 121) This --max-num-seqs param is necessary to ensure that the running queue can only hold one request at a time, forcing the waiting queue to hold the rest. max_queue_length controls the max length of the waiting queue.
  - comment out the original “--max-num-seqs” (lines 116 and 117).

Sync Changes

args_utils.py - add a new parameter called max_queue_length to EngineArgs
- EngineArgs.max_queue_length
- parser.add_argument() max_queue_length
scheduler.py - add param to SchedulerConfig object
- SchedulerConfig init()
config.py
- SchedulerConfig verifyArgs()
- get_max_queue_length
llm_engine.py
- check if exceeds max_queue_length in _add_processed_request
engine_args.rst
- for parsing cli args

Async Changes

async_llm_engine.py
- check for queue overflow error
serving_chat.py
- add status to create_error_reponse()
serving_completion.py
- add status to create_error_reponse()

debug.txt

tests/async_engine/test_max_queue_length.py

simon-mo · 2024-04-19T05:20:12Z

tests/async_engine/test_max_queue_length.py

+    sample_chats = [[{
+        "role": "system",
+        "content": "You are a helpful assistant."
+    }, {
+        "role": "user",
+        "content": "Who won the world series in 2020?"
+    }],
+                    [{
+                        "role": "system",
+                        "content": "You are a helpful assistant."
+                    }, {
+                        "role": "user",
+                        "content": "Where was the 2020 world series played?"
+                    }],
+                    [{
+                        "role": "system",
+                        "content": "You are a helpful assistant."
+                    }, {
+                        "role": "user",
+                        "content": "How long did the 2020 world series last?"
+                    }],
+                    [{
+                        "role": "system",
+                        "content": "You are a helpful assistant."
+                    }, {
+                        "role":
+                        "user",
+                        "content":
+                        "What were some television viewership statistics?"
+                    }],
+                    [{
+                        "role": "system",
+                        "content": "You are a helpful assistant."
+                    }, {
+                        "role": "user",
+                        "content": "Why was the 2020 world series so popular?"
+                    }]]


use completion api so the test is a lot shorter.

simon-mo · 2024-04-19T05:22:57Z

tests/async_engine/test_max_queue_length.py

+    async def make_api_call(sample_chat):
+        chat_completion = await client.chat.completions.create(
+            messages=sample_chat,
+            model=model_name,
+            temperature=0.8,
+            presence_penalty=0.2,
+            max_tokens=400,
+        )
+        return chat_completion
+
+    async def main():
+        coroutines = [
+            make_api_call(sample_chat) for sample_chat in sample_chats
+        ]
+
+        responses = await asyncio.gather(*coroutines, return_exceptions=True)
+
+        for response in responses:
+            logger.info(response)
+            if isinstance(response, JSONResponse):
+                assert response.status_code == 503
+
+    await main()


You are already in sync context. You can make it a lot easier by just doing

coroutines = [ client.chat.completions.create( messages=sample_chat, model=model_name, temperature=0.8, presence_penalty=0.2, max_tokens=400, ) for sample_chat in sample_chats] responses = asyncio.gather(*coroutines, ...) for ...

Also because you are using the server, you should just move the test into test_openai_server.py in entrypoints to reuse the same server.

simon-mo · 2024-04-19T05:24:07Z

tests/engine/tmql.py

simon-mo · 2024-04-19T05:24:21Z

vllm/config.py

+        # TODO: verify max_queue_length
+
+


… max-queue-len

KrishnaM251 and others added 16 commits February 29, 2024 14:48

pushing changes to my fork

92f98a8

implemented and some testing

b59ed0c

Merge branch 'vllm-project:main' into max-queue-time

b646393

fixed the format of my edits

18fdf16

attempting to sync with main repo

2442f41

rebased to upstream repo?

3e8f938

done with max_queue_length

45b4cb2

finished max-queue-len implementation

dd75ff0

attempt to resolve max-queue-len issues

8c94850

should be up-to-date with upstream

a814e0d

adding tests

7a8a231

Merge branch 'vllm-project:main' into max-queue-len

3585833

opening PR

28e758a

sync for pr

375d7f4

removing unncecessary function

d8d7fe3

actual change

631420b

simon-mo reviewed Apr 19, 2024

View reviewed changes

simon-mo self-assigned this Apr 19, 2024

KrishnaM251 added 12 commits April 18, 2024 23:17

addressed simon's notes

b669d4b

fixing rebasing conflicts

2c931eb

implemented and some testing

e5dc13a

fixed the format of my edits

939f597

half to bfloat16

074dfa8

fixed double change bfloat16

2f3e49c

finished max-queue-len implementation

5eca9d9

queueoverflow + imports

c56b89c

opening PR

f83d79e

removing unncecessary function

302aadb

actual change

9833cc7

more import error handling

054b578

KrishnaM251 and others added 10 commits June 26, 2024 03:31

enhanced error catching + added test_openai_server testcase

2f372c5

resolve remote branch conflicts

beea8f7

Merge branch 'vllm-project:main' into max-queue-len

319a7f3

ran format.sh

fee6fcd

resolve conflicts

74a3d39

removing max pad

7e4d793

correct teest behaviour

60c0c01

tested, formatted

7e1ac35

Merge branch 'vllm-project:main' into max-queue-len

0981444

Merge branch 'max-queue-len' of github.com:KrishnaM251/vllm-fork into…

7314f68

… max-queue-len

KrishnaM251 closed this Jun 27, 2024

chudyandrej mentioned this pull request Jul 20, 2025

[Core] Add max-waiting-queue-length parameter to reject requests when queue is full #21271

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding max queue time parameter#4190

Adding max queue time parameter#4190
KrishnaM251 wants to merge 38 commits intovllm-project:mainfrom
KrishnaM251:max-queue-len

KrishnaM251 commented Apr 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simon-mo Apr 19, 2024

Uh oh!

simon-mo Apr 19, 2024

Uh oh!

simon-mo Apr 19, 2024

Uh oh!

simon-mo Apr 19, 2024

Uh oh!

simon-mo Apr 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

KrishnaM251 commented Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simon-mo Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

simon-mo Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

simon-mo Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

simon-mo Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

simon-mo Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KrishnaM251 commented Apr 19, 2024 •

edited

Loading