This repository was archived by the owner on Oct 11, 2024. It is now read-only.
[2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config#188
Merged
varun-sundar-rabindranath merged 17 commits intors/vllm-quantizationfrom Apr 16, 2024
Conversation
|
|
||
| return param[shard_id], loaded_weight | ||
|
|
||
| def get_layer_format(self, layer_name: str) -> SmoothQuantFormat: |
Collaborator
Author
There was a problem hiding this comment.
This should be moved to SmoothQuantConfig I think
Collaborator
Author
There was a problem hiding this comment.
Or at least the core logic should. This api probably still needs to exist, but this logic fits better conceptually with the spirit of the QuantizationConfig
|
LGTM 👍 |
varun-sundar-rabindranath
pushed a commit
that referenced
this pull request
Apr 16, 2024
This merge is a combination of 2 PRs - #186 and #188: - #188 is based on #186 and #188 is squash-merged onto #186. #186 : [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #188 : [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config The PR description from both the PRs are included here for context. #188 's PR description should be the most relevant as it is the most recent. [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config Refactored to support nonuniform quantization by adding a new layer of Abstraction. Now, SmoothQuantLinearMethod can hold a SmoothQuantFormat, which implements the details of how to do quant and dequant operations. There are two SmoothQuantFormat classes: SmoothQuantDynamicPerToken SmoothQuantStaticPerTensor We have the following lifecycle: LinearMethod is created during get_model, has access to QuantizationConfig Layer is initialized and passed a LinearMethod Layer calls LinearMethod.create_weights, which creates a dictionary of weights and metadata Layer calls LinearMethod.apply_weights during inference, passing the dictionary created during create_weights This PR modifies the LinearMethod.create_weights API to receive a layer_name as argument. The LinearMethod then looks in the config to determine which SmoothQuantFormat to use for the layer with layer_name As a result, the LinearMethod is responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since the SmoothQuantConfig is not very good, we just match on the suffix qkv to determine what each layer should use --> but for SparseMLConfig, we could use a similar structure In this PR, the SmoothQuantFormat is passed in the dictionary returned by create_weights and then is used by apply_weights In Summary I think this is a good overall structure because it: (a) allows us to make minimal changes to the existing models (b) allows us to make no changes to the model loading lifecycle (i.e. config / constructor / linear method) ** critically requires having one LinearMethod that propagates through the whole model (c) encapsulates the nonuniform logic into the LinearMethod, allowing us to have a clean interface into For SparseML Models We could imagine the following architecture: Config Config is responsible for: loading config from disk mapping layer_names --> SparseMLFormat class SparseMLConfig def from_dict() def get_layer_format(layer_name): return SparseMLFormat LinearMethod Config is responsible for: interface between layers and kernels (so LinearMethod is what is used by the model) class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def create_weights(layer_name, ...): # this, e.g. is where nonuniform might be supported format = self.sparseml_config.get_layer_format(layer_name) weights = format.get_weights() weights["format"] = format return weights # wrapper around the SparseML format def apply_weights(x, weights, ...) format = weights["format"] weights = weights["weights"] return format.apply_weights(x, weights) SparseMLFormat Format is responsible for: actual weight creation and forward class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def get_weights(sizes): # returns dictionary , e.g. return { "weights": x "scales": y } def apply_weights(weights, x): # calls cuda kernel return output Sample Formats: - W8A8DynamicPerToken - SparseW8A8StaticPerTensorAsymmetric - W4A8DynamicPerToken - ... [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #186 Paired with @dsikka to refactor `SmoothQuantLinearMethod` to avoid making changes to `llama.py` - Removed all the "layer specific" `SmoothQuantLinearMethod` by making the indexing (splitting QKV into logical shards generic and explicitly handling state_dict converion - Successfully whittled down to only add one LOC to `llama.py` Many todos left, including: - We currently have hardcoded `use_per_token`, need to use the quant config for this - We need a way to pass different quantconfigs to each layer to support nonuniform quantization
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactored to support nonuniform quantization by adding a new layer of Abstraction.
Now,
SmoothQuantLinearMethodcan hold aSmoothQuantFormat, which implements the details of how to do quant and dequant operations. There are twoSmoothQuantFormatclasses:SmoothQuantDynamicPerTokenSmoothQuantStaticPerTensorWe have the following lifecycle:
LinearMethodis created duringget_model, has access toQuantizationConfigLayeris initialized and passed aLinearMethodLayercallsLinearMethod.create_weights, which creates a dictionary of weights and metadataLayercallsLinearMethod.apply_weightsduring inference, passing the dictionary created duringcreate_weightsThis PR modifies the
LinearMethod.create_weightsAPI to receive alayer_nameas argument. TheLinearMethodthen looks in theconfigto determine whichSmoothQuantFormatto use for the layer withlayer_nameLinearMethodis responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since theSmoothQuantConfigis not very good, we just match on the suffixqkvto determine what each layer should use --> but for SparseMLConfig, we could use a similar structureIn this PR, the
SmoothQuantFormatis passed in the dictionary returned bycreate_weightsand then is used byapply_weightsIn Summary
I think this is a good overall structure because it:
LinearMethod, allowing us to have a clean interface intoFor SparseML Models
We could imagine the following architecture:
Config
Config is responsible for:
SparseMLFormatLinearMethod
Config is responsible for:
SparseMLFormat
Format is responsible for:
Sample Formats:
-
W8A8DynamicPerToken-
SparseW8A8StaticPerTensorAsymmetric-
W4A8DynamicPerToken- ...