-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move quant API to quantization README #142
Conversation
README.md
Outdated
4. Integration with other PyTorch native libraries like torchtune and ExecuTorch | ||
2. [Quantization algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile. | ||
3. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks | ||
4. Integration with other PyTorch native libraries like [torchtune](https://github.com/pytorch/torchtune) and [ExecuTorch](https://github.com/pytorch/executorch) | ||
|
||
## Key Features | ||
* Native PyTorch techniques, composable with torch.compile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you link to the quantization readme from the main page?
This technique works best when the torch._inductor.config.use_mixed_mm option is enabled. This avoids dequantizing the weight tensor before the matmul, instead fusing the dequantization into the matmul, thereby avoiding materialization of a large floating point weight tensor. | ||
|
||
|
||
## A16W4 WeightOnly Quantization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it also possible to add an example on how to use GPTQ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sure
2. Quantization [algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile. | ||
3. Sparsity [algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks | ||
4. Integration with other PyTorch native libraries like torchtune and ExecuTorch | ||
2. [Quantization algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@supriyar I was hoping that people can just find quantization README here, or do you feel we want to make it more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that a getting started on the main page would be important to keep. So if GPTQ is the algorithm we feel most people would be interested in let's just show only that and then link to a broader set of algorithms as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we get some comment from torchtune that community has moved on to other techniques now, so I feel it's fine to keep it in the separate quantization page
9346a94
to
64e0035
Compare
2. Quantization [algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile. | ||
3. Sparsity [algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks | ||
4. Integration with other PyTorch native libraries like torchtune and ExecuTorch | ||
## Our Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@supriyar I also restructured the README a bit, please take a look
2. While these techniques are designed to improve model performance, in some cases the opposite can occur. This is because quantization adds additional overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization) or loading weights faster (weight-only quantization). If your matmuls are small enough or your non-quantized perf isn't bottlenecked by weight load time, these techniques may reduce performance. | ||
3. Use the PyTorch nightlies so you can leverage [tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) which is preferred over older module swap based methods because it doesn't modify the graph and is generally more composable and flexible. | ||
|
||
## Get Started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msaroufim @supriyar I added a get started section here and linked to the API READEMEs
Summary: att Test Plan: / Reviewers: Subscribers: Tasks: Tags:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Summary: att Test Plan: / Reviewers: Subscribers: Tasks: Tags:
* fast start instructions * Update GETTING-STARTED.md Co-authored-by: Nikita Shulga <[email protected]> * Update GETTING-STARTED.md Co-authored-by: Nikita Shulga <[email protected]> * Update GETTING-STARTED.md Co-authored-by: Nikita Shulga <[email protected]> --------- Co-authored-by: Nikita Shulga <[email protected]>
Summary:
att
Test Plan:
/
Reviewers:
Subscribers:
Tasks:
Tags: