-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add README.md #155
add README.md #155
Conversation
torchao/sparsity/README.md
Outdated
|
||
# Design | ||
|
||
Pruning, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename the folder to pruning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sparsity is the more widely used term, so let's keep it called that. But I'll change Pruning -> Sparsity where it makes sense in the README
torchao/sparsity/README.md
Outdated
@@ -0,0 +1,664 @@ | |||
# torchao sparsity | |||
|
|||
Sparsity is the technique of removing parameters from a neural network in order to reduce its memory overhead or latency. By carefully choosing the elements that are removed, one can achieve significant reduction in memory overhead and latency, while paying a reasonably low or no price in terms of model quality (accuracy / f1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call pruning the "technique of removing parameters from a neural network in order to reduce its memory overhead or latency"
|
||
Sparsity, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique. | ||
|
||
In quantization, the theoretical performance gain is generally determined by the data type that we are quantizing to - quantizing from float32 to float16 yields a theoretical 2x speedup. For pruning/sparsity, the analogous variable would be the sparsity level/ sparsity pattern. For semi-structured, the sparsity level is fixed at 50%, so we expect a theoretical 2x improvement. For block-sparse matrices and unstructured sparsity, the speedup is variable and depends on the sparsity level of the tensor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's roughly a theoretical speedup of 2x. Or put differently, 2x is a very basic estimate just because of the reduce amount of memory that needs to processed. In practice it can vary quite a bit. It could even be a lot more, because it allows you to use faster caches, etc.
torchao/sparsity/README.md
Outdated
|
||
In quantization, the theoretical performance gain is generally determined by the data type that we are quantizing to - quantizing from float32 to float16 yields a theoretical 2x speedup. For pruning/sparsity, the analogous variable would be the sparsity level/ sparsity pattern. For semi-structured, the sparsity level is fixed at 50%, so we expect a theoretical 2x improvement. For block-sparse matrices and unstructured sparsity, the speedup is variable and depends on the sparsity level of the tensor. | ||
|
||
One key difference between sparsity and quantization is in how the accuracy degradation is determined: The accuracy degradation of quantization is determined by the scale and zero_point chosen. However, in pruning the accuracy degradation is determined by the mask. By carefully choosing the specified elements and retraining the network, pruning can achieve negligible accuracy degradation and in some cases even provide a slight accuracy gain. This is an active area of research with no agreed-upon consensus. We expect users will have a target sparsity pattern and mind and to prune to that pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit biased towards affine quantization and sparsity aware training specifically for matrix multiplication. There's many other variables that influences accuracy degradation. For example the operation used and the distribution of input values.
The measure that is model quality is the same between sparsity and quantization. Some of the mitigation techniques are the same too (e.g. quantization or sparsity aware training). Where it differs, I'd say, is that sparsity explicitly relies on approximating a sum of numbers (hence the focus on zero), whereas in quantization you avoid allocating bits for unused numerical ranges/unnecessary numerical fidelty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some more context to the end of this section, but for this and the comment above, I want to keep this as newbie friendly as possible, so I think it's okay to have a relative flawed / forceful analogy to make a point.
I think explaining things in the most faithful way introduces a lot of jargon, which is kind of overwhelming.
torchao/sparsity/README.md
Outdated
|
||
Given a target sparsity pattern, pruning a model can then be thought of as two separate subproblems: | ||
|
||
* How can I find a set of sparse weights which satisfy my target sparsity pattern that minimize the accuracy degradation of my model? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so this first part is what I'd call pruning.
torchao/sparsity/README.md
Outdated
Given a target sparsity pattern, pruning a model can then be thought of as two separate subproblems: | ||
|
||
* How can I find a set of sparse weights which satisfy my target sparsity pattern that minimize the accuracy degradation of my model? | ||
* How can I accelerate my sparse weights for inference and reduced memory overhead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then sparsity can be the task of accelerating pruned weights. It's not always necessary to use a sparse layout or sparse kernel. Sometimes you can prune in ways that obviate these specialized techniques. For example, you can just skip an entire layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really enjoyed reading this, it is missing code samples but I believe what you intended to write was closer to a survey of sparsity and the parameter space a library should be in which case I believe this does the job well
torchao/sparsity/README.md
Outdated
|
||
FakeSparsity is a parameterization which simulates unstructured sparsity, where each element has a mask. Because of this, we can use it to simulate any sparsity pattern we want. | ||
|
||
The user will then train the prepared model using their own custom code, calling .step() to update the mask if necessary. Once they’ve found a suitable mask, they call `squash_mask()` to fuse the mask into the weights, creating a dense tensor with 0s in the right spot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow this line, what's step? Also this seems to indicate that people need to change their training code and if so how?
is this line also necessary for people only interested in accelerated inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated with a code sample, that should make this a bit easier to follow.
So we have a docs page now https://github.com/pytorch/ao/blob/jcaip/sparsity-readme/docs/source/sparsity.rst |
I think this is conceptually the right long term home for most of the stuff in the README, but I feel like this information will get lost right now vs if we put it in the README. |
* added readme * update * add README * update * fix images * update * cleaned up * fix * fix formatting * update * update readme * fix images * updated README again * update ---------
add README.md to sparsity folder