[Kernel] Enhance MoE benchmarking & tuning script#4921
Conversation
|
@pcmoritz The PR is not ready now. I will ping you once it's ready. |
|
Sounds great, thank you :) |
|
@pcmoritz This PR is ready now. Sorry for the delay. |
|
One small gotcha I was running into while trying this out is that currently fp8 can't be benchmarked with an FP16 checkpoint, e.g. errors out since diff --git a/benchmarks/kernels/benchmark_moe.py b/benchmarks/kernels/benchmark_moe.py
index 6796ea401..3f3005e20 100644
--- a/benchmarks/kernels/benchmark_moe.py
+++ b/benchmarks/kernels/benchmark_moe.py
@@ -46,6 +46,8 @@ def benchmark_config(
w2_scale = torch.randn(num_experts, dtype=torch.float32)
a1_scale = torch.randn(1, dtype=torch.float32)
a2_scale = torch.randn(1, dtype=torch.float32)
+ w1 = w1.to(torch.float8_e4m3fn)
+ w2 = w2.to(torch.float8_e4m3fn)
input_gating = torch.empty(num_tokens, num_experts, dtype=torch.float32)
since FP8 checkpoints are not widely available yet and also for vLLM FP8 we support running FP16 checkpoints in FP8 :) |
|
@pcmoritz I addressed your comments. PTAL. |
There was a problem hiding this comment.
Thanks! I've been using the new script to do some tuning for FP8 and it works like a charm, thanks a lot for improving it -- I'll open a PR with the new configs shortly after I have tested the configs!
Btw, in order to get progress bars, I've been using this modification:
from ray.experimental.tqdm_ray import tqdmand then where we iterate over the configs:
for config in tqdm(search_space):This will make sure to print progress bars without messing up stdout and it works like this: https://docs.ray.io/en/latest/ray-observability/user-guides/configure-logging.html#distributed-progress-bars-tqdm
Feel free to add it (don't worry it is currently in the experimental namespace -- I think it is one of the APIs that should be stabilized and I'll look into that).
|
@pcmoritz Ray tqdm is really cool! I actually wanted to have exactly the same feature. Happy to add that! |
Tune Qwen2-57B-A14B configs based on #4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
…#5497) Tune Qwen2-57B-A14B configs based on vllm-project#4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
…#5497) Tune Qwen2-57B-A14B configs based on vllm-project#4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
…#5497) Tune Qwen2-57B-A14B configs based on vllm-project#4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
…#5497) Tune Qwen2-57B-A14B configs based on vllm-project#4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
…#5497) Tune Qwen2-57B-A14B configs based on vllm-project#4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
This PR is to enhance the MoE tuning & benchmarking script which is a bit hacky at the moment. Also, the PR enables using multiple GPUs for benchmarking via Ray.