Turn off autotune for scaled mm for fp8 dynamic quant in torchao #2116

jerryzh168 · 2024-11-21T19:43:28Z

Summary:
Currently we do autotune in torch.compile (max-autotune-no-cudagraph), that will call triton kernel autotune for fp8 dynamic quant, but it increases compilation time significantly and not much benefit for llama models, so we turn this off for now

Test Plan:
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --json-model-override-args '{"architectures": ["TorchNativeLlamaForCausalLM"]}' --enable-torch-compile --torchao-config fp8dq-per_row

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Currently we do autotune in torch.compile (max-autotune-no-cudagraph), that will call triton kernel autotune for fp8 dynamic quant, but it increases compilation time significantly and not much benefit for llama models, so we turn this off for now Test Plan: python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --json-model-override-args '{"architectures": ["TorchNativeLlamaForCausalLM"]}' --enable-torch-compile --torchao-config fp8dq-per_row Reviewers: Subscribers: Tasks: Tags:

jerryzh168 · 2024-11-21T19:44:37Z

Note: this will turn off fp8 autotune everywhere actually, but it might be fine for now, we could revisit if this is an issue for other models

python/sglang/srt/models/torch_native_llama.py

jerryzh168 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners November 21, 2024 19:43

merrymercy reviewed Nov 21, 2024

View reviewed changes

python/sglang/srt/models/torch_native_llama.py Outdated Show resolved Hide resolved

Apply suggestions from code review

3d35fc1

merrymercy merged commit 7f8fcd3 into sgl-project:main Nov 21, 2024
1 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn off autotune for scaled mm for fp8 dynamic quant in torchao #2116

Turn off autotune for scaled mm for fp8 dynamic quant in torchao #2116

jerryzh168 commented Nov 21, 2024 •

edited

Loading

jerryzh168 commented Nov 21, 2024

Turn off autotune for scaled mm for fp8 dynamic quant in torchao #2116

Turn off autotune for scaled mm for fp8 dynamic quant in torchao #2116

Conversation

jerryzh168 commented Nov 21, 2024 • edited Loading

jerryzh168 commented Nov 21, 2024

jerryzh168 commented Nov 21, 2024 •

edited

Loading