fix(gguf): Disable bfloat16 for GGUF on blackwell device#30408
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request aims to fix precision issues with GGUF models on Blackwell GPUs by defaulting to float16. The change correctly disables bfloat16 for GGUF on Blackwell devices. However, I've found a critical issue where the device capability check uses 120 instead of 100 for Blackwell, which would prevent the fix from being applied. I've provided a suggestion to correct this.
yewentao256
left a comment
There was a problem hiding this comment.
Thanks for the work! One minor update before landing
|
Hi @kitaekatt, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
yewentao256
left a comment
There was a problem hiding this comment.
It would be a little bit unclear for SM (10.0+), let's just remove them and we all know blackwell.
GGUF dequantization kernels use half precision (fp16) internally via the `dfloat` typedef. On Blackwell GPUs (sm_120), using bfloat16 causes garbage output due to dtype mismatch. Approach taken (middle ground): - arg_utils.py: Auto-set dtype to float16 when dtype="auto" for GGUF - gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override This defaults to safe behavior while preserving user control. Users on hardware where bfloat16 works can still use --dtype bfloat16 explicitly. Options considered: 1. Blanket removal of bfloat16 from GGUF - rejected (breaks working configs) 2. Blackwell-specific detection - rejected (maintenance burden, edge cases) 3. Default fp16 + allow explicit bf16 - chosen (simple, safe, preserves choice) Tested on RTX 5090 (sm_120) with Qwen3-4B-GGUF: 583.8 tok/s Signed-off-by: Christina <truffle@gmail.com>
…ity check Instead of removing bfloat16 support globally, use device capability detection to disable bfloat16 only on SM 120+ devices (Blackwell). This preserves bfloat16 support on older architectures where tests show it works correctly, while preventing precision issues on Blackwell. Co-Authored-By: Isotr0py <isotr0py@users.noreply.github.com> Signed-off-by: Christina <truffle@gmail.com>
Per review feedback: the arg_utils.py dtype override breaks Gemma2 GGUF which doesn't support FP16. The Blackwell-specific bfloat16 restriction in gguf.py's get_supported_act_dtypes() is sufficient - let _resolve_auto_dtype handle dtype selection automatically. Signed-off-by: Christina <truffle@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Isotr0py <2037008807@qq.com>
Head branch was pushed to by a user without write access
8493901 to
cb5a036
Compare
|
Hi @kitaekatt, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
I know this PR is already approved, but I just did some isolated testing of this PR so sharing results. Tested on Blackwell HardwareGPU: RTX 5090 (SM 12.0, compute capability 120) Test: Direct validation of Before (upstream/main): >>> GGUFConfig().get_supported_act_dtypes()
[torch.float16, torch.bfloat16, torch.float32] # bfloat16 included ❌After (this PR): >>> GGUFConfig().get_supported_act_dtypes()
WARNING: GGUF has precision issues with bfloat16 on Blackwell.
[torch.float16, torch.float32] # bfloat16 excluded ✓Validated that bfloat16 is correctly excluded on Blackwell devices. |
Remove '(SM 10.0+)' from comment and warning message per reviewer feedback. yewentao256: 'It would be a little bit unclear for SM (10.0+), let's just remove them' Signed-off-by: Christina Norman <christina@example.com> Signed-off-by: Christina <truffle@gmail.com>
cb5a036 to
157a9fe
Compare
|
Fixed DCO (added sign-off) and pre-commit (single-line format). |
Cherry-pick the Blackwell dtype fix to ensure GGUF models work correctly on RTX 5090 and other Blackwell GPUs during metadata extraction testing. This fix excludes bfloat16 from supported dtypes on Blackwell devices to avoid precision issues with GGUF dequantization kernels. Signed-off-by: Christina Norman <christina@example.com>
…t#30408) Signed-off-by: Christina <truffle@gmail.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Christina Norman <christina@example.com> Co-authored-by: Isotr0py <isotr0py@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…t#30408) Signed-off-by: Christina <truffle@gmail.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Christina Norman <christina@example.com> Co-authored-by: Isotr0py <isotr0py@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…t#30408) Signed-off-by: Christina <truffle@gmail.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Christina Norman <christina@example.com> Co-authored-by: Isotr0py <isotr0py@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Summary
Fixes incorrect output from GGUF models on Blackwell (SM 120+) GPUs by defaulting to float16 dtype.
Changes
Root Cause
bfloat16 causes precision issues with GGUF quantized weights on Blackwell architecture.
Testing
Tested with multiple GGUF models on RTX 5090.