Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support INT4 Dequant onto GPU for Seq INT TBE look up #3584

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

faran928
Copy link

Summary:
Seq INT4 -> INT4 STBE look up is supported in the diff stack: https://www.internalfb.com/diff/D61305978 .

This diff supports:

  1. The dequanitzation of INT4 -> INT4 STBE look up onto Cuda for all float types
  2. Extends the dequantization of INT4 > INT4 STBE look up onto CPU for BF16

The main gap is to handle the dequant for the case when scale bias for INT4 quantized tensor is in the front. While for CPU, just need to add the dequantization for BF16 based on dtype.

This will enable us to reduce the network overhead to remote embedding server as well as D2H data transfer from onto GPU host.

Differential Revision: D68187234

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68187234

Copy link

netlify bot commented Jan 17, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit cd85b52
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/678a987b0976080008d5a7c4
😎 Deploy Preview https://deploy-preview-3584--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

faran928 added a commit to faran928/FBGEMM that referenced this pull request Jan 17, 2025
Summary:

Seq INT4 -> INT4 STBE look up is supported in the diff stack: https://www.internalfb.com/diff/D61305978 . 

This diff supports:

1. The dequanitzation of INT4 -> INT4 STBE look up onto Cuda for all float types
2. Extends the dequantization of INT4 > INT4 STBE look up onto CPU for BF16

The main gap is to handle the dequant for the case when scale bias for INT4 quantized tensor is in the front. While for CPU, just need to add the dequantization for BF16 based on dtype.

This will enable us to reduce the network overhead to remote embedding server as well as D2H data transfer from onto GPU host.

Differential Revision: D68187234
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68187234

Summary:

Seq INT4 -> INT4 STBE look up is supported in the diff stack: https://www.internalfb.com/diff/D61305978 . 

This diff supports:

1. The dequanitzation of INT4 -> INT4 STBE look up onto Cuda for all float types
2. Extends the dequantization of INT4 > INT4 STBE look up onto CPU for BF16

The main gap is to handle the dequant for the case when scale bias for INT4 quantized tensor is in the front. While for CPU, just need to add the dequantization for BF16 based on dtype.

This will enable us to reduce the network overhead to remote embedding server as well as D2H data transfer from onto GPU host.

Differential Revision: D68187234
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68187234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants