[2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config by robertgshaw2-redhat · Pull Request #188 · neuralmagic/nm-vllm

robertgshaw2-redhat · 2024-04-13T23:40:18Z

Refactored to support nonuniform quantization by adding a new layer of Abstraction.

Now, SmoothQuantLinearMethod can hold a SmoothQuantFormat, which implements the details of how to do quant and dequant operations. There are two SmoothQuantFormat classes:

SmoothQuantDynamicPerToken
SmoothQuantStaticPerTensor

We have the following lifecycle:

LinearMethod is created during get_model, has access to QuantizationConfig
Layer is initialized and passed a LinearMethod
Layer calls LinearMethod.create_weights, which creates a dictionary of weights and metadata
Layer calls LinearMethod.apply_weights during inference, passing the dictionary created during create_weights

This PR modifies the LinearMethod.create_weights API to receive a layer_name as argument. The LinearMethod then looks in the config to determine which SmoothQuantFormat to use for the layer with layer_name

As a result, the LinearMethod is responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since the SmoothQuantConfig is not very good, we just match on the suffix qkv to determine what each layer should use --> but for SparseMLConfig, we could use a similar structure

In this PR, the SmoothQuantFormat is passed in the dictionary returned by create_weights and then is used by apply_weights

In Summary

I think this is a good overall structure because it:

(a) allows us to make minimal changes to the existing models
(b) allows us to make no changes to the model loading lifecycle (i.e. config / constructor / linear method) ** critically requires having one LinearMethod that propagates through the whole model
(c) encapsulates the nonuniform logic into the LinearMethod, allowing us to have a clean interface into

For SparseML Models

We could imagine the following architecture:

Config

Config is responsible for:

loading config from disk
mapping layer_names --> SparseMLFormat

class SparseMLConfig
    def from_dict()
    
    def get_layer_format(layer_name):
         return SparseMLFormat

LinearMethod

Config is responsible for:

interface between layers and kernels (so LinearMethod is what is used by the model)

class SparseMLLinearMethod:
    def __init__(self, sparseml_config)
          self.sparseml_config = sparseml_config
          
    def create_weights(layer_name, ...):
          # this, e.g. is where nonuniform might be supported
          format = self.sparseml_config.get_layer_format(layer_name)
          
          weights = format.get_weights()
          weights["format"] = format
          
          return weights
     
     # wrapper around the SparseML format
     def apply_weights(x, weights, ...)
           format = weights["format"]
           weights = weights["weights"]
           
           return format.apply_weights(x, weights)

SparseMLFormat

Format is responsible for:

actual weight creation and forward

class SparseMLLinearMethod:
    def __init__(self, sparseml_config)
          self.sparseml_config = sparseml_config
          
    def get_weights(sizes):
         # returns dictionary , e.g.
         return {
             "weights": x
             "scales": y
         }
     
     def apply_weights(weights, x):
         # calls cuda kernel 
         return output

Sample Formats:
- W8A8DynamicPerToken
- SparseW8A8StaticPerTensorAsymmetric
- W4A8DynamicPerToken
- ...

robertgshaw2-redhat · 2024-04-14T20:53:07Z

vllm/model_executor/layers/quantization/smoothquant/config.py

+
+        return param[shard_id], loaded_weight
+
+    def get_layer_format(self, layer_name: str) -> SmoothQuantFormat:


This should be moved to SmoothQuantConfig I think

Or at least the core logic should. This api probably still needs to exist, but this logic fits better conceptually with the spirit of the QuantizationConfig

vllm/model_executor/layers/linear.py

vllm/model_executor/layers/quantization/smoothquant/config.py

vllm/model_executor/layers/quantization/base_config.py

varun-sundar-rabindranath · 2024-04-16T15:28:39Z

LGTM 👍

@dsikka

This merge is a combination of 2 PRs - #186 and #188: - #188 is based on #186 and #188 is squash-merged onto #186. #186 : [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #188 : [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config The PR description from both the PRs are included here for context. #188 's PR description should be the most relevant as it is the most recent. [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config Refactored to support nonuniform quantization by adding a new layer of Abstraction. Now, SmoothQuantLinearMethod can hold a SmoothQuantFormat, which implements the details of how to do quant and dequant operations. There are two SmoothQuantFormat classes: SmoothQuantDynamicPerToken SmoothQuantStaticPerTensor We have the following lifecycle: LinearMethod is created during get_model, has access to QuantizationConfig Layer is initialized and passed a LinearMethod Layer calls LinearMethod.create_weights, which creates a dictionary of weights and metadata Layer calls LinearMethod.apply_weights during inference, passing the dictionary created during create_weights This PR modifies the LinearMethod.create_weights API to receive a layer_name as argument. The LinearMethod then looks in the config to determine which SmoothQuantFormat to use for the layer with layer_name As a result, the LinearMethod is responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since the SmoothQuantConfig is not very good, we just match on the suffix qkv to determine what each layer should use --> but for SparseMLConfig, we could use a similar structure In this PR, the SmoothQuantFormat is passed in the dictionary returned by create_weights and then is used by apply_weights In Summary I think this is a good overall structure because it: (a) allows us to make minimal changes to the existing models (b) allows us to make no changes to the model loading lifecycle (i.e. config / constructor / linear method) ** critically requires having one LinearMethod that propagates through the whole model (c) encapsulates the nonuniform logic into the LinearMethod, allowing us to have a clean interface into For SparseML Models We could imagine the following architecture: Config Config is responsible for: loading config from disk mapping layer_names --> SparseMLFormat class SparseMLConfig def from_dict() def get_layer_format(layer_name): return SparseMLFormat LinearMethod Config is responsible for: interface between layers and kernels (so LinearMethod is what is used by the model) class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def create_weights(layer_name, ...): # this, e.g. is where nonuniform might be supported format = self.sparseml_config.get_layer_format(layer_name) weights = format.get_weights() weights["format"] = format return weights # wrapper around the SparseML format def apply_weights(x, weights, ...) format = weights["format"] weights = weights["weights"] return format.apply_weights(x, weights) SparseMLFormat Format is responsible for: actual weight creation and forward class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def get_weights(sizes): # returns dictionary , e.g. return { "weights": x "scales": y } def apply_weights(weights, x): # calls cuda kernel return output Sample Formats: - W8A8DynamicPerToken - SparseW8A8StaticPerTensorAsymmetric - W4A8DynamicPerToken - ... [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes #186 Paired with @dsikka to refactor `SmoothQuantLinearMethod` to avoid making changes to `llama.py` - Removed all the "layer specific" `SmoothQuantLinearMethod` by making the indexing (splitting QKV into logical shards generic and explicitly handling state_dict converion - Successfully whittled down to only add one LOC to `llama.py` Many todos left, including: - We currently have hardcoded `use_per_token`, need to use the quant config for this - We need a way to pass different quantconfigs to each layer to support nonuniform quantization

Robert Shaw and others added 16 commits April 13, 2024 23:33

refactored to support nonuniform quantization

70ba2dd

nit

877f0dc

cleanup name

91e9ae6

cleanup

1c50e00

cleanup base_config.py

a905e31

cleanup more

7133cb8

cleanup more

cc285a6

cleanup final?

7e0bc3e

cleanup final.

88c9a37

cleanup final..

69224a9

cleanup final..

1726950

Update config.py

da7f5e8

training newline

30bff1b

cleanup

33e827e

cleanup smoothquant config

cbf98e8

removed passing around logical_weights in Linear modules

71c352c

robertgshaw2-redhat marked this pull request as ready for review April 14, 2024 01:20

fixed nits to make sure existing quantization schemes can run

8565c75

robertgshaw2-redhat changed the title ~~Rs/vllm quantization Refactor for Nonuniform Quantization~~ [2/N] Rs/vllm quantization - Refactor To Support Non-Uniform via Config Apr 14, 2024

robertgshaw2-redhat changed the title ~~[2/N] Rs/vllm quantization - Refactor To Support Non-Uniform via Config~~ [2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config Apr 14, 2024

robertgshaw2-redhat commented Apr 14, 2024

View reviewed changes

varun-sundar-rabindranath reviewed Apr 16, 2024

View reviewed changes

vllm/model_executor/layers/linear.py Show resolved Hide resolved

varun-sundar-rabindranath reviewed Apr 16, 2024

View reviewed changes

vllm/model_executor/layers/quantization/smoothquant/config.py Show resolved Hide resolved

varun-sundar-rabindranath reviewed Apr 16, 2024

View reviewed changes

vllm/model_executor/layers/quantization/base_config.py Show resolved Hide resolved

varun-sundar-rabindranath merged commit 1afab71 into rs/vllm-quantization Apr 16, 2024

varun-sundar-rabindranath deleted the rs/vllm-quantization-config-refactor branch April 16, 2024 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config#188

[2/N] Rs/vllm quantization - Refactor refactor to support non-uniform via config#188
varun-sundar-rabindranath merged 17 commits intors/vllm-quantizationfrom
rs/vllm-quantization-config-refactor

robertgshaw2-redhat commented Apr 13, 2024 •

edited

Loading

Uh oh!

robertgshaw2-redhat Apr 14, 2024

Uh oh!

robertgshaw2-redhat Apr 14, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		return param[shard_id], loaded_weight

		def get_layer_format(self, layer_name: str) -> SmoothQuantFormat:

Conversation

robertgshaw2-redhat commented Apr 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In Summary

For SparseML Models

Config

LinearMethod

SparseMLFormat

Uh oh!

robertgshaw2-redhat Apr 14, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Apr 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robertgshaw2-redhat commented Apr 13, 2024 •

edited

Loading