Llamas 3.1 405B fp4 changes upstreaming from 355_wip#25135
Llamas 3.1 405B fp4 changes upstreaming from 355_wip#25135mgoin merged 17 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
798e475 to
f9626ee
Compare
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
SageMoore
left a comment
There was a problem hiding this comment.
Can we get some unit tests for the batched_rotary_embedding kernel?
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
removed batched rope for now to speed up this PR landing |
fxmarty-amd
left a comment
There was a problem hiding this comment.
great that cdna4 mxfp4 gemm gets upstreamed!
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Show resolved
Hide resolved
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
|
CC @mgoin |
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
|
CI test model-executor-test fails on main:f552d5e578077574276aa9d83139b91e1d5ae163 as well which this branch is based on. Please force merge this PR. Thanks. |
mgoin
left a comment
There was a problem hiding this comment.
Let's remove the x_quant_scales change to the linear layer
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
removed, please look again. |
|
CI test that failed, passes locally: |
| if self.emulate: | ||
| layer.weight_scale = torch.nn.Parameter(layer.weight_scale.data, | ||
| requires_grad=False) | ||
| try: | ||
| from quark.torch.export.nn.modules import realquantizer | ||
| from quark.torch.quantization.config.config import ( | ||
| QuantizationSpec) | ||
| except ImportError as err: | ||
| raise ImportError( | ||
| "The package `amd-quark` is required to use AMD Quark " | ||
| "MX-FP4 models. Please install it with `pip install " | ||
| "amd-quark`.") from err | ||
|
|
||
| weight_quant_spec = QuantizationSpec.from_dict( | ||
| self.weight_quant_spec) | ||
|
|
||
| weight_quantizer = realquantizer.get_real_quantizer( | ||
| qspec=weight_quant_spec, | ||
| quantizer=None, | ||
| real_quantized=True, | ||
| reorder=False, | ||
| float_dtype=self.out_dtype, | ||
| scale_shape=layer.weight_scale.shape, | ||
| zero_point_shape=None, | ||
| ) | ||
| weight_quantizer.scale.data = layer.weight_scale.data | ||
|
|
||
| layer.weight = torch.nn.Parameter( | ||
| weight_quantizer(layer.weight.data).to(self.out_dtype), | ||
| requires_grad=False, | ||
| ) | ||
| layer.weight_scale = None | ||
|
|
||
| # This call is necessary to release the scales memory. | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
I insist that this is unnecessary https://github.com/vllm-project/vllm/pull/25135/files#r2378191214 - was not able to reopen the thread that was closed unfortunately.
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Perf is the same:
upstream tp1
355_wip tp1
Command:
Run the client benchmark
Correctness - shows reasonable answers for command: