Feat/blockwise fp8 quant #1668

Degnel · 2025-02-05T15:08:53Z

Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors

WARNING: The code has been tested on the following files:

pytest test/float8/test_base.py
pytest test/float8/test_compile.py
pytest test/float8/test_numerics_integration.py

However, tests have not been performed on the following files due to limitations (Triton is unavailable on Windows and I don't own an NVIDIA GPU):

./test/float8/test_fsdp.sh
./test/float8/test_dtensor.sh
python test/float8/test_fsdp2/test_fsdp2.py

- first implementation of the DeepSeek blockwise quantizer (not fully fonctionnal) - amax has been unpdated - 2 more quantisation recipes has been added - a couple of things here and there to make it consistent

pytorch-bot · 2025-02-05T15:08:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1668

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2025-02-05T15:46:52Z

could you share what gemm kernel you plan to use in this PR? I think a good first step here is to have a fast gemm.

we have an issue tracking this here: #1594

cassanof · 2025-02-06T06:54:55Z

this might be a good place to start: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/1d044fd82b15f1cedb197a288e50cc96a2c27205/inference/kernel.py#L63

vkuzo · 2025-02-06T16:45:56Z

Overall it would be great to be able to support this recipe in torchao. I think having a gemm with compelling performance that supports 128x1 and 128x128 scaling is something we need first, with benchmarks comparing to other recipes such as rowwise scaled, etc.

supriyar · 2025-02-06T17:17:24Z

Relevant PR in SGLang that adds the triton kernels - sgl-project/sglang#2575 (thanks to @HandH1998). I think it makes sense to add this as a starting point to torchao.

Degnel · 2025-02-08T12:34:07Z

Here's what we have now. Obvioulsy, much slower than the row-wise W4A8, but I'll implement a precision benchmark to keep things fair.

vkuzo · 2025-02-21T13:52:20Z

benchmarks/float8/bench_matmul.py

            scale_a = torch.ones(M, 1, device=device)
            scale_b = torch.ones(1, N, device=device)
+        else:
+            assert scaling_granularity == ScalingGranularity.BLOCKWISE, "unsupported"


this file is benchmarking torch._scaled_mm which does not support blockwise scaling, is this change intended?

Unintended, but I will rework this PR. There were some details that I had missed when I initially worked on it.

vkuzo · 2025-02-21T13:54:26Z

@Degnel , thanks for adding the gemm, this is great to see. How do you feel about splitting the gemm + gemm benchmarks into a separate PR which we can quickly land to the prototype folder to get some performance data? We are happy to run some benchmarks for you on H100s.

I would like to see how gemm performance stacks up to help us understand whether the overall workflow should go in the prototype folder or if we should add it to torchao.float8 directly.

vkuzo · 2025-02-21T13:59:20Z

I also have a high level question on the type of scaling this PR implements. From the fact that the block size is specified as block_size: int = 128, I'm guessing it's 128x1 blocks - is that correct?

In https://arxiv.org/html/2412.19437v1 Section 3.3.2, the report specifies that activations are tiled 128x1, and weights blocked 128x128, so I just wanted to check if this PR is trying to implement the gemm from the paper as written or making a modification.

Degnel · 2025-02-21T14:36:10Z

Hi @vkuzo, I was just getting back to work on this issue. I also felt like it would be quicker to open a new PR gemm + gemme bench, and add integrations latter. For now, I have rent a A100, and it seems like W4A8 is both faster and more precise. That is why I don't feel like it is relevant to put the code into float8 (I've put in prototype on my local repositiory).

Degnel · 2025-02-21T14:41:38Z

I also have a high level question on the type of scaling this PR implements. From the fact that the block size is specified as block_size: int = 128, I'm guessing it's 128x1 blocks - is that correct?

In https://arxiv.org/html/2412.19437v1 Section 3.3.2, the report specifies that activations are tiled 128x1, and weights blocked 128x128, so I just wanted to check if this PR is trying to implement the gemm from the paper as written or making a modification.

At the time, I thought about making it simpler for the first version, but the current gemm that I have, does support the 128x1 (for activation) and 128x128 (for weights).

Degnel · 2025-02-21T14:43:38Z

I have memory leak issues for now. Once those are resolved, I will make a new PR.

vkuzo · 2025-02-21T15:03:55Z

it seems like W4A8 is both faster and more precise

are you interested in training, inference or both? w4a8 is more for inference.

Degnel · 2025-02-22T13:19:49Z

Only inference for now, but I agree it would be interesting to add training bench.

Degnel · 2025-02-22T13:27:20Z

The new PR containing only the gemm and the benchmark is available at #1763

Degnel and others added 8 commits February 1, 2025 14:25

Feat: blockwise fp8 quantizer

55dab5f

- first implementation of the DeepSeek blockwise quantizer (not fully fonctionnal) - amax has been unpdated - 2 more quantisation recipes has been added - a couple of things here and there to make it consistent

Feat: fp8 linear layer with blockwise quantization

5ab1eb2

Merge branch 'pytorch:main' into feat/blockwise_fp8_quant

4af3c34

Feat: adding assertions in the ops file

aa7dc87

Feat: adding some tests for blockwise fp8 quant

167fdce

Fix: fixes for the blockwise_fp8_quantization

9e9d16e

Merge branch 'pytorch:main' into feat/blockwise_fp8_quant

cf5802a

linting

6c9246a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2025

Degnel mentioned this pull request Feb 5, 2025

[float8] Add support for blockwise fp8 quantization scheme used in DeepSeek v3 #1594

Open

Feat/test: quant/dequant weight/act + test

89c6ed0

vkuzo reviewed Feb 21, 2025

View reviewed changes

linting

91b368d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/blockwise fp8 quant #1668

Feat/blockwise fp8 quant #1668

Degnel commented Feb 5, 2025

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

vkuzo commented Feb 5, 2025

cassanof commented Feb 6, 2025

vkuzo commented Feb 6, 2025

supriyar commented Feb 6, 2025

Degnel commented Feb 8, 2025 •

edited

Loading

vkuzo Feb 21, 2025

Degnel Feb 21, 2025

vkuzo commented Feb 21, 2025

vkuzo commented Feb 21, 2025

Degnel commented Feb 21, 2025

Degnel commented Feb 21, 2025 •

edited

Loading

Degnel commented Feb 21, 2025

vkuzo commented Feb 21, 2025

Degnel commented Feb 22, 2025

Degnel commented Feb 22, 2025

Feat/blockwise fp8 quant #1668

Are you sure you want to change the base?

Feat/blockwise fp8 quant #1668

Conversation

Degnel commented Feb 5, 2025

pytorch-bot bot commented Feb 5, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1668

vkuzo commented Feb 5, 2025

cassanof commented Feb 6, 2025

vkuzo commented Feb 6, 2025

supriyar commented Feb 6, 2025

Degnel commented Feb 8, 2025 • edited Loading

vkuzo Feb 21, 2025

Choose a reason for hiding this comment

Degnel Feb 21, 2025

Choose a reason for hiding this comment

vkuzo commented Feb 21, 2025

vkuzo commented Feb 21, 2025

Degnel commented Feb 21, 2025

Degnel commented Feb 21, 2025 • edited Loading

Degnel commented Feb 21, 2025

vkuzo commented Feb 21, 2025

Degnel commented Feb 22, 2025

Degnel commented Feb 22, 2025

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

Degnel commented Feb 8, 2025 •

edited

Loading

Degnel commented Feb 21, 2025 •

edited

Loading