-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
[1/N] Elastic EP Milestone 2 #34861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
[1/N] Elastic EP Milestone 2 #34861
Changes from all commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
ef5723c
eep phase: stateless group + CUDA graph support
libertyeagle 66ce382
fix precommit
tlrmchlsmth f2d6300
require current vllm config to be set in init_model_parallel as well
tlrmchlsmth a02b00e
partially handle review comments from cursor
tlrmchlsmth d76dfea
use allgather_reducescatter instead of pplx
tlrmchlsmth 38ffbf8
update
tlrmchlsmth 7e0cd87
world_size -> world_size_across_dp for executor selection
tlrmchlsmth 65829d7
dummy_weights -> load_dummy_weights and fix cpu model runner
tlrmchlsmth 31b3bb7
fixup
tlrmchlsmth 998169b
set current vllm config in tests
tlrmchlsmth 1700fcf
more vllm config wrestling
tlrmchlsmth d9914ff
More current_vllm_config fixes
tlrmchlsmth fced7ec
precommit
tlrmchlsmth dad7698
[CI Fix] Fix tests to set vllm_config before initialize_model_parallel
rtourgeman 5718393
[CI Fix] Fix circular reference between AsyncLLM and output_handler
rtourgeman f685cf2
Remove support dynamo
itayalroy e169d1e
Create a stateless EPLB group for elastic EP
itayalroy 88c8aeb
Pass sp_size to FusedMoEParallelConfig.make
itayalroy ffcfa2c
Fix eplb is async field name
itayalroy 3ae0a4e
Fix is_ep_communicator check
itayalroy 5553e32
Reinit modular kernel on EP scaling events
itayalroy 47ad0b0
Implement destroy for DeepEP
itayalroy cbdce04
Torch recompile on existing ranks after scale up/down
itayalroy 60fb2a5
Single api server in test elastic ep
itayalroy 037ba81
Reduce memory util in test_elastic_ep.py
itayalroy 30bbd48
Force stop ray procs between tests
itayalroy 5ac19a9
Defer elastic EP port allocation to after ray.init()
itayalroy de2fec6
Graceful comm group destruction on scale-down
itayalroy 833bbc8
Move elastic EP standby groups into dedicated module
itayalroy 69363f0
Deduplicate local_all_ranks calculation
itayalroy 9643d19
Fix vllm config in tests
itayalroy ac220cb
Require PP=1 for elastic EP
itayalroy 35dad3f
Fix ZMQ port TOCTOU in MPClient
itayalroy 8394d65
Revert pplx kernels installation changes
itayalroy 369bbd3
Fix dp weight transfer send/recv mismatch
itayalroy 1d8a0f4
Remove unused param max_concurrent_workers
itayalroy 942b156
Update dp_size in vllm_config used in create_standby_groups
itayalroy bb00d97
Fixed staged barrier possible race
itayalroy 9da3694
Fix test_elastic_ep path
itayalroy d87fe82
Merge branch 'main' into eep_m2_rebase
tlrmchlsmth File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| import os | ||
| import subprocess | ||
| import time | ||
|
|
||
| import pytest | ||
| import requests | ||
|
|
||
| from ..evals.gsm8k.gsm8k_eval import evaluate_gsm8k | ||
| from ..utils import RemoteOpenAIServer, multi_gpu_test | ||
|
|
||
|
|
||
| @pytest.fixture(autouse=True) | ||
| def cleanup_ray_between_tests(): | ||
| """Force-stop any lingering Ray processes between tests.""" | ||
| subprocess.run(["ray", "stop", "--force"], timeout=30, capture_output=True) | ||
| time.sleep(5) | ||
| yield | ||
|
|
||
|
|
||
| MODEL_NAME = "deepseek-ai/DeepSeek-V2-Lite-Chat" | ||
|
|
||
| NUM_GSM8K_QUESTIONS = 256 | ||
| EXPECTED_ACCURACY = 0.58 | ||
| ACCURACY_TOL = 0.08 | ||
| MAX_NUM_SEQS = 32 | ||
|
|
||
|
|
||
| def _send_scale_command(server: RemoteOpenAIServer, new_dp_size: int) -> bool: | ||
| url = server.url_for("scale_elastic_ep") | ||
| payload = {"new_data_parallel_size": new_dp_size} | ||
| headers = {"Content-Type": "application/json"} | ||
|
|
||
| try: | ||
| response = requests.post(url, json=payload, headers=headers, timeout=300) | ||
| return response.status_code == 200 | ||
| except requests.exceptions.RequestException: | ||
| return False | ||
|
|
||
|
|
||
| def _run_gsm8k_eval(server: RemoteOpenAIServer, stage: str) -> float: | ||
| assert server.port is not None | ||
| result = evaluate_gsm8k( | ||
| num_questions=NUM_GSM8K_QUESTIONS, | ||
| host=f"http://{server.host}", | ||
| port=server.port, | ||
| ) | ||
| accuracy = result["accuracy"] | ||
| print( | ||
| f"[{stage}] GSM8K accuracy: {accuracy:.3f} " | ||
| f"({result['num_questions']} questions)" | ||
| ) | ||
| assert accuracy >= EXPECTED_ACCURACY, ( | ||
| f"[{stage}] GSM8K accuracy {accuracy:.3f} is below " | ||
| f"expected threshold {EXPECTED_ACCURACY}" | ||
| ) | ||
| return accuracy | ||
|
|
||
|
|
||
| @multi_gpu_test(num_gpus=4) | ||
| def test_elastic_ep_scaling(): | ||
| vllm_serve_args = [ | ||
| "--trust-remote-code", | ||
| "--tensor-parallel-size", | ||
| "1", | ||
| "--gpu-memory-utilization", | ||
| "0.8", | ||
| "--max-model-len", | ||
| "4096", | ||
| "--max-num-seqs", | ||
| str(MAX_NUM_SEQS), | ||
| "--enable-expert-parallel", | ||
| "--all2all-backend", | ||
| "allgather_reducescatter", | ||
| "--enable-elastic-ep", | ||
| "--enable-eplb", | ||
| "--eplb-config.num_redundant_experts", | ||
| "0", | ||
| "--data-parallel-backend", | ||
| "ray", | ||
| "--data-parallel-size", | ||
| "2", | ||
| "--api-server-count", | ||
| "1", | ||
| ] | ||
|
|
||
| leader_address = os.environ.get("LEADER_ADDRESS") | ||
| if leader_address: | ||
| vllm_serve_args.extend(["--data-parallel-address", leader_address]) | ||
|
|
||
| with RemoteOpenAIServer( | ||
| MODEL_NAME, vllm_serve_args, env_dict={}, max_wait_seconds=1200 | ||
| ) as server: | ||
| initial_accuracy = _run_gsm8k_eval(server, "Initial (2 GPUs)") | ||
|
|
||
| assert _send_scale_command(server, 4) | ||
| time.sleep(10) | ||
| scale_up_accuracy = _run_gsm8k_eval(server, "After scale up (4 GPUs)") | ||
|
|
||
| assert scale_up_accuracy >= initial_accuracy - ACCURACY_TOL, ( | ||
| f"Scale up accuracy {scale_up_accuracy:.3f} dropped more than " | ||
| f"{ACCURACY_TOL} below initial accuracy {initial_accuracy:.3f}" | ||
| ) | ||
|
|
||
| assert _send_scale_command(server, 2) | ||
| time.sleep(5) | ||
| scale_down_accuracy = _run_gsm8k_eval(server, "After scale down (2 GPUs)") | ||
|
|
||
| assert scale_down_accuracy >= initial_accuracy - ACCURACY_TOL, ( | ||
| f"Scale down accuracy {scale_down_accuracy:.3f} dropped more than " | ||
| f"{ACCURACY_TOL} below initial accuracy {initial_accuracy:.3f}" | ||
| ) | ||
|
|
||
| print("\nAccuracy Summary:") | ||
| print(f" Initial: {initial_accuracy:.3f}") | ||
| print( | ||
| f" Scale up: {scale_up_accuracy:.3f} " | ||
| f"(diff: {scale_up_accuracy - initial_accuracy:+.3f})" | ||
| ) | ||
| print( | ||
| f" Scale down: {scale_down_accuracy:.3f} " | ||
| f"(diff: {scale_down_accuracy - initial_accuracy:+.3f})" | ||
| ) | ||
| print(f" Tolerance: {ACCURACY_TOL:.3f}") | ||
|
|
||
|
|
||
| @multi_gpu_test(num_gpus=4) | ||
| def test_elastic_ep_scaling_uneven(): | ||
| """Test scale up with uneven worker distribution. | ||
|
|
||
| This tests the case where num_new_workers % old_dp_size != 0, | ||
| specifically 2 -> 3 where remainder = 1 % 2 = 1. | ||
| This exercises the remainder handling in sender-receiver pairing. | ||
| """ | ||
| vllm_serve_args = [ | ||
| "--trust-remote-code", | ||
| "--tensor-parallel-size", | ||
| "1", | ||
| "--gpu-memory-utilization", | ||
| "0.8", | ||
| "--max-model-len", | ||
| "4096", | ||
| "--max-num-seqs", | ||
| str(MAX_NUM_SEQS), | ||
| "--enable-expert-parallel", | ||
| "--all2all-backend", | ||
| "allgather_reducescatter", | ||
| "--enable-elastic-ep", | ||
| "--enable-eplb", | ||
| "--eplb-config.num_redundant_experts", | ||
| "0", | ||
| "--data-parallel-backend", | ||
| "ray", | ||
| "--data-parallel-size", | ||
| "2", | ||
| "--api-server-count", | ||
| "1", | ||
| ] | ||
|
|
||
| leader_address = os.environ.get("LEADER_ADDRESS") | ||
| if leader_address: | ||
| vllm_serve_args.extend(["--data-parallel-address", leader_address]) | ||
|
|
||
| with RemoteOpenAIServer( | ||
| MODEL_NAME, vllm_serve_args, env_dict={}, max_wait_seconds=1200 | ||
| ) as server: | ||
| initial_accuracy = _run_gsm8k_eval(server, "Initial (2 GPUs)") | ||
|
|
||
| # Scale 2 -> 3: This has remainder = 1 % 2 = 1 | ||
| # Tests uneven sender-receiver pairing | ||
| assert _send_scale_command(server, 3) | ||
| time.sleep(10) | ||
| scale_up_accuracy = _run_gsm8k_eval(server, "After scale up (3 GPUs)") | ||
|
|
||
| assert scale_up_accuracy >= initial_accuracy - ACCURACY_TOL, ( | ||
| f"Scale up accuracy {scale_up_accuracy:.3f} dropped more than " | ||
| f"{ACCURACY_TOL} below initial accuracy {initial_accuracy:.3f}" | ||
| ) | ||
|
|
||
| # Scale back down to 2 | ||
| assert _send_scale_command(server, 2) | ||
| time.sleep(5) | ||
| scale_down_accuracy = _run_gsm8k_eval(server, "After scale down (2 GPUs)") | ||
|
|
||
| assert scale_down_accuracy >= initial_accuracy - ACCURACY_TOL, ( | ||
| f"Scale down accuracy {scale_down_accuracy:.3f} dropped more than " | ||
| f"{ACCURACY_TOL} below initial accuracy {initial_accuracy:.3f}" | ||
| ) | ||
|
|
||
| print("\nAccuracy Summary (Uneven Scaling):") | ||
| print(f" Initial: {initial_accuracy:.3f}") | ||
| print( | ||
| f" Scale up: {scale_up_accuracy:.3f} " | ||
| f"(diff: {scale_up_accuracy - initial_accuracy:+.3f})" | ||
| ) | ||
| print( | ||
| f" Scale down: {scale_down_accuracy:.3f} " | ||
| f"(diff: {scale_down_accuracy - initial_accuracy:+.3f})" | ||
| ) | ||
| print(f" Tolerance: {ACCURACY_TOL:.3f}") | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itayalroy do you know why this is needed? Does something go wrong when the api server count is greater than 1?
The default is to set the api server count equal to the number of DP ranks, so would be good to make an issue if this is broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API server maintains state across scaling events (the core_engines list, scaling flags, etc.). To support multiple API servers, we need to either sync this state between them or ensure all scaling requests are handled by the same API server- Neither is implemented in this PR.
This issue comes from the original PR, see comment in
parallel.pyL661: