Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors #1763

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Degnel
Copy link

@Degnel Degnel commented Feb 22, 2025

This PR is the first step towards addressing issue #1594. It includes the following implementations:

  • fp8 triton gemm for blockwise quantisation
  • quant, dequant and linear utilities
  • time & precision benchmarks
  • basic tests

If the code is validated, it would be great to bench it on H100.

Degnel and others added 2 commits February 22, 2025 14:13
- fp8 triton gemm
- quant, dequant and linear utils
- time & precision benchmarks
- basic tests
Copy link

pytorch-bot bot commented Feb 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1763

Note: Links to docs will display an error until the docs builds have been completed.

❌ 11 New Failures

As of commit 8d68d45 with merge base 25ddb77 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 22, 2025
@danielvegamyhre
Copy link
Contributor

Thanks for your work on this! I'll take a closer look next week.

cc @vkuzo @drisspg

@Degnel
Copy link
Author

Degnel commented Feb 25, 2025

Thanks for running the tests. I have two questions regarding the errors:

  • Where should I add Triton to allow the tests to run successfully without introducing unnecessary dependencies in dev-requirements.txt?
  • Does torchao provide any utility to check the available FP8 types for each gpu architecture?

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Feb 27, 2025

Thanks for running the tests. I have two questions regarding the errors:

  • Where should I add Triton to allow the tests to run successfully without introducing unnecessary dependencies in dev-requirements.txt?

Can you clarify what you mean? Are tests failing in CI due to a missing triton installation? That shouldn't be happening, please share the link/logs if so.

  • Does torchao provide any utility to check the available FP8 types for each gpu architecture?

We just use helpers which skip tests if GPU architecture is not at least SM 89:

def is_sm_at_least_89():

You can find examples in the float8 tests (example).

@danielvegamyhre danielvegamyhre self-assigned this Feb 27, 2025
@danielvegamyhre danielvegamyhre self-requested a review February 27, 2025 17:45
@Degnel
Copy link
Author

Degnel commented Feb 28, 2025

Can you clarify what you mean? Are tests failing in CI due to a missing triton installation? That shouldn't be happening, please share the link/logs if so.

Indeed, they are. It looks like only the CPU runs are failing. I presume that bitsandbytes might not install triton when no GPU is available (I might be missing something there). Here is an instance of a failing log:

https://github.com/pytorch/ao/actions/runs/13484452669/job/37730985419?pr=1763#step:14:1276

We just use helpers which skip tests if GPU architecture is not at least SM 89:

def is_sm_at_least_89():

You can find examples in the float8 tests (example).

Thank you for the hint, I've locally updated the code accordingly 👍

- removing triton dependency
- cleanning adaptative dtype
W_q, W_s = fp8_blockwise_weight_quant(W, block_size, dtype)
output_blockwise = blockwise_fp8_gemm(A_q, A_s, W_q, W_s)

quantize_(lin, int8_dynamic_activation_int4_weight())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is int8_dynamic_activation_int4_weight being used here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank's for noticing it. I was aiming for a static W4A8 quantization and I overlooked that it was dynamic. I will try to address this within the week.

@danielvegamyhre
Copy link
Contributor

Can you clarify what you mean? Are tests failing in CI due to a missing triton installation? That shouldn't be happening, please share the link/logs if so.

Also @Degnel you should skip tests requiring triton if CUDA is not available.

@danielvegamyhre
Copy link
Contributor

@Degnel thanks for your work on this, i ran the tests and it looks like your blockwise fp8 gemm test is failing due to quantization error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants