[Bug] Enforce contiguous input for dynamic_scaled_fp8_quant and static_scaled_fp8_quant#21773
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request addresses a RuntimeError that occurs when non-contiguous tensors are passed to the dynamic_scaled_fp8_quant and static_scaled_fp8_quant custom operators. The fix correctly enforces input contiguity by calling .contiguous() on the input tensor before it is passed to the underlying C++ kernels.
The change is a direct and effective solution to the bug described, and it aligns with the existing pattern for other custom operator calls within the same function. The pull request description clearly communicates that this is an intentional, temporary measure to ensure stability, with a more performant, long-term solution tracked in a separate issue. The code is clean, localized, and I found no issues of high or critical severity. The pull request is ready to be merged.
mgoin
left a comment
There was a problem hiding this comment.
LGTM. I'm curious what the strides actually are. I think the kernels should be able to support non-contiguous inputs as-is if input.stride(-1) == 1
Yeah, I think I will look at #19630 later to see if we can support this case using scalar fallback |
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: x22x22 <wadeking@qq.com>
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…atic_scaled_fp8_quant` (vllm-project#21773) Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Similar to #19452, we temporally fix by enforcing input contiguous. Will work on #19630 later to see we can support some non-contiguous input case.
lm_eval --model vllm --model_args "pretrained=nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8,max_model_len=32768,enable_expert_parallel=True,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size autoTest