Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn off autotune for scaled mm for fp8 dynamic quant in torchao #2116

Merged
merged 2 commits into from
Nov 21, 2024

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Nov 21, 2024

Summary:
Currently we do autotune in torch.compile (max-autotune-no-cudagraph), that will call triton kernel autotune for fp8 dynamic quant, but it increases compilation time significantly and not much benefit for llama models, so we turn this off for now

Test Plan:
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --json-model-override-args '{"architectures": ["TorchNativeLlamaForCausalLM"]}' --enable-torch-compile --torchao-config fp8dq-per_row

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:
Currently we do autotune in torch.compile (max-autotune-no-cudagraph), that will call triton
kernel autotune for fp8 dynamic quant, but it increases compilation time significantly and
not much benefit for llama models, so we turn this off for now

Test Plan:
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --json-model-override-args '{"architectures": ["TorchNativeLlamaForCausalLM"]}' --enable-torch-compile --torchao-config fp8dq-per_row

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168
Copy link
Contributor Author

Note: this will turn off fp8 autotune everywhere actually, but it might be fine for now, we could revisit if this is an issue for other models

@merrymercy merrymercy merged commit 7f8fcd3 into sgl-project:main Nov 21, 2024
1 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants