Support async scheduling with TPU-inference's RayExecutor#1912
Conversation
Signed-off-by: Guangxiang Du <gxd@google.com>
Signed-off-by: Guangxiang Du <gxd@google.com>
Signed-off-by: Guangxiang Du <gxd@google.com>
DescriptionStart with a short description of what the PR does and how this is a change from The rest of the description includes relevant details and context, examples:
If the change fixes a Github issue, please include a link, e.g.,: TestsPlease describe how you tested this change, and include any instructions and/or ChecklistBefore submitting this PR, please make sure:
|
|
Great job Guangxiang! We should also enable async scheduling in multi-host/disagg e2e testing. |
|
Will do, once the vLLM repo commit is submitted :) |
DescriptionStart with a short description of what the PR does and how this is a change from The rest of the description includes relevant details and context, examples:
If the change fixes a Github issue, please include a link, e.g.,: TestsPlease describe how you tested this change, and include any instructions and/or ChecklistBefore submitting this PR, please make sure:
|
Signed-off-by: Guangxiang Du <gxd@google.com>
Support async scheduling with TPU-inference's RayExecutor
Implement the functionality in TPU-inference's RayExecutor subclass instead of vLLM repo's RayExecutor parent class for more flexibility.
TPUPlatform override vLLM's repo's
Platform's executors_supports_async_scheduling() so that our custom executor is in the whitelist that support async scheduling.Sent a separate PR to vLLM repo: vllm-project/vllm#36924
Share similar idea as vllm-project/vllm#29012.
Tests
Unit test.
E2E benchmark: xprof: http://xprof/trace_viewer.html?session_id=gxd-2909041915368964943 (there is no TPU bubble now)
Quality test:
Checklist
Before submitting this PR, please make sure: