Skip to content

Conversation

@liz-badada
Copy link
Collaborator

@liz-badada liz-badada commented Apr 16, 2025

Motivation

Support torch profiling when enable SGLang PD Disaggregation. Will Generate prefill & decode profile separately.

Test based on #5435

Modifications

Add start_profile / stop_profile request to mini_lb.py

Examples (TP16 + DP16 Attn + DeepEP on H100)

Prepare ./pd_node*.json first.

Server

SGLANG_TORCH_PROFILER_DIR=./profile_log MOONCAKE_CONFIG_PATH=./pd_node0.json python -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--chwan--DeepSeek-V3-5layer/snapshots/38a0c0ee55158e7d2ac9a6af1de94c4dfe084872/ --disaggregation-mode prefill --host 10.10.38.7 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.7:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8

SGLANG_TORCH_PROFILER_DIR=./profile_log MOONCAKE_CONFIG_PATH=./pd_node1.json python -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--chwan--DeepSeek-V3-5layer/snapshots/38a0c0ee55158e7d2ac9a6af1de94c4dfe084872/ --disaggregation-mode prefill --host 10.10.38.8 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.7:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8

SGLANG_TORCH_PROFILER_DIR=./profile_log MOONCAKE_CONFIG_PATH=./pd_node2.json python -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--chwan--DeepSeek-V3-5layer/snapshots/38a0c0ee55158e7d2ac9a6af1de94c4dfe084872/ --disaggregation-mode decode --host 10.10.38.10 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128

SGLANG_TORCH_PROFILER_DIR=./profile_log MOONCAKE_CONFIG_PATH=./pd_node3.json python -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--chwan--DeepSeek-V3-5layer/snapshots/38a0c0ee55158e7d2ac9a6af1de94c4dfe084872/ --disaggregation-mode decode --host 10.10.38.4 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128

Proxy

SGLANG_TORCH_PROFILER_DIR=./profile_log python3 -m sglang.srt.disaggregation.mini_lb --prefill http://10.10.38.7:30000/ --decode http://10.10.38.10:30001/ --host 0.0.0.0 --port 8000

Bench with Profiling

SGLANG_TORCH_PROFILER_DIR=./profile_log python -m sglang.bench_serving --backend sglang --model ~/.cache/huggingface/hub/models--chwan--DeepSeek-V3-5layer/snapshots/38a0c0ee55158e7d2ac9a6af1de94c4dfe084872/ --host 127.0.0.1 --port 8000 --num-prompts 20 --dataset-name random --random-input-len 1000 --random-output-len 4 --profile

event_loop_normal_disagg_prefill
image

event_loop_normal_disagg_decode
image

Note

Known bug from PyTroch Profiler UTF-8 decode: pytorch/pytorch#64345

Checklist

@liz-badada liz-badada changed the title support torch profiling with pd disaggregation Torch Profiling with PD Disaggregation Apr 16, 2025
@liz-badada liz-badada changed the title Torch Profiling with PD Disaggregation Torch Profiling with DeepEP PD Disaggregation Apr 16, 2025
@whybeyoung
Copy link
Collaborator

LGTM

@liz-badada liz-badada marked this pull request as ready for review April 19, 2025 11:03
@merrymercy
Copy link
Contributor

please fix the lint

self.decode_servers = decode_servers
self.profiling = False

profile_dir = os.getenv("SGLANG_TORCH_PROFILER_DIR", "./tmp")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is changing the output directory necessary?

self.profiling = False

profile_dir = os.getenv("SGLANG_TORCH_PROFILER_DIR", "./tmp")
os.makedirs(profile_dir, exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating profile_dir is not the job of lb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants