Support mixtral moe AWQ quantization. #2725

AllentDan · 2024-11-07T10:37:50Z

No description provided.

lvhan028 · 2024-11-08T03:09:25Z

@zhulinJulia24 please add mistralai/Mixtral-8x7B-Instruct-v0.1 awq quantization into test cases

lvhan028 · 2024-11-08T07:56:58Z

lmdeploy/lite/quantization/awq.py

@@ -244,6 +254,9 @@ def quant_weights(model,
        if skip_if_contains and skip_if_contains in child_name:
            q_linear = fc
            pack_or_skip = 'skipped'
+        elif 'block_sparse_moe.gate' in name:  # moe


This is an additional skip patch, considering we already possess the skip_if _contains functionality. There is a growing concern that as more Mixtures of Experts (MoEs) are integrated, the skipping branches may become increasingly difficult to maintain.

deepindeed2022 · 2024-11-10T04:20:55Z

使用模型 Plap-8x13B 进行量化的过程中
lmdeploy lite auto_awq models/Plap-8x13B --work-dir models/Plap-8x13B/awq
在https://github.com/InternLM/lmdeploy/pull/2725/files#diff-f5acea3cc09c0c379b3f5df99146564676fcf925849a161c480c9289834b0023L108 出现IndexError，
File "/opt/lmdeploy/lmdeploy/lite/quantization/activation/observer.py", line 108, in observe
cur_max = cur_val.max(0)[0].cpu()
cur_val 值为tensor([], device='cuda:0', dtype=torch.bfloat16)

请问可能是什么原因？是我们这个pr的bug还是awq量化算法的问题，麻烦提供点思路。

deepindeed2022 · 2024-11-10T07:42:51Z

使用模型 Plap-8x13B 进行量化的过程中 lmdeploy lite auto_awq models/Plap-8x13B --work-dir models/Plap-8x13B/awq 在https://github.com/InternLM/lmdeploy/pull/2725/files#diff-f5acea3cc09c0c379b3f5df99146564676fcf925849a161c480c9289834b0023L108 出现IndexError， File "/opt/lmdeploy/lmdeploy/lite/quantization/activation/observer.py", line 108, in observe cur_max = cur_val.max(0)[0].cpu() cur_val 值为tensor([], device='cuda:0', dtype=torch.bfloat16)

请问可能是什么原因？是我们这个pr的bug还是awq量化算法的问题，麻烦提供点思路。

测试 mistralai/Mixtral-8x7B-Instruct-v0.1 模型是OK的。8x13B 的 hidden_size 是 5120，8x7B 的hidden_size 是 4096.

AllentDan · 2024-11-11T01:52:20Z

模型访问不了。可以断点进 observer.py 108 行，应该是模型推理过程某个层前后有个 tensor shape 不对

deepindeed2022 · 2024-11-11T07:15:57Z

模型访问不了。可以断点进 observer.py 108 行，应该是模型推理过程某个层前后有个 tensor shape 不对

模型可以直接在HF上下载。有一个弹窗，点一下就可以跳过。
是一个expert的输入存在问题，如下

Layer:model.layers.0.self_attn.q_proj, group:inputs, shape:torch.Size([1, 2048, 5120]), weight:torch.Size([5120, 5120])
Layer:model.layers.0.self_attn.k_proj, group:inputs, shape:torch.Size([1, 2048, 5120]), weight:torch.Size([5120, 5120])
Layer:model.layers.0.self_attn.v_proj, group:inputs, shape:torch.Size([1, 2048, 5120]), weight:torch.Size([5120, 5120])
Layer:model.layers.0.self_attn.o_proj, group:inputs, shape:torch.Size([1, 2048, 5120]), weight:torch.Size([5120, 5120])
Layer:model.layers.0.block_sparse_moe.gate, group:inputs, shape:torch.Size([2048, 5120]), weight:torch.Size([8, 5120])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(605)forward()
-> current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
(Pdb) ll
604  	    def forward(self, hidden_states):
605 B->	        current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
606  	        current_hidden_states = self.w2(current_hidden_states)
607  	        return current_hidden_states
(Pdb) p hidden_states.shape
torch.Size([12, 5120])
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.0.w1, group:inputs, shape:torch.Size([12, 5120]), weight:torch.Size([13824, 5120])
Layer:model.layers.0.block_sparse_moe.experts.0.w3, group:inputs, shape:torch.Size([12, 5120]), weight:torch.Size([13824, 5120])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(606)forward()
-> current_hidden_states = self.w2(current_hidden_states)
(Pdb) p current_hidden_states.shape
torch.Size([12, 13824])
(Pdb) p self.w2.weight.shape
torch.Size([5120, 13824])
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.0.w2, group:inputs, shape:torch.Size([12, 13824]), weight:torch.Size([5120, 13824])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(607)forward()
-> return current_hidden_states
(Pdb) n
--Return--
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(607)forward()->tensor([[ 0.0...torch.float16)
-> return current_hidden_states
(Pdb) n
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(673)forward()
-> final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
(Pdb) c
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(605)forward()
-> current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
(Pdb) p hidden_states.shape
torch.Size([1, 5120])
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.1.w1, group:inputs, shape:torch.Size([1, 5120]), weight:torch.Size([13824, 5120])
Layer:model.layers.0.block_sparse_moe.experts.1.w3, group:inputs, shape:torch.Size([1, 5120]), weight:torch.Size([13824, 5120])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(606)forward()
-> current_hidden_states = self.w2(current_hidden_states)
(Pdb) p current_hidden_states.shape
torch.Size([1, 13824])
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.1.w2, group:inputs, shape:torch.Size([1, 13824]), weight:torch.Size([5120, 13824])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(607)forward()
-> return current_hidden_states
(Pdb) c
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(605)forward()
-> current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.2.w1, group:inputs, shape:torch.Size([0, 5120]), weight:torch.Size([13824, 5120])
IndexError: max(): Expected reduction dim 0 to have non-zero size.
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(605)forward()
-> current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
(Pdb) p hidden_states.shape
torch.Size([0, 5120])

这个可能是模型的原因，中间一个 moe层的 expert_mask 为全 0 会导致这个问题。

anaivebird · 2024-11-12T06:57:50Z

Error when awq quantize mistralai/Mixtral-8x7B-Instruct-v0.1 model @AllentDan

Traceback (most recent call last):
  File "/opt/py38/bin/lmdeploy", line 11, in <module>
    load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')()
  File "/workspace/xingwu/lmdeploy/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/workspace/xingwu/lmdeploy/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/apis/auto_awq.py", line 91, in auto_awq
    vl_model, model, tokenizer, work_dir = calibrate(model,
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/apis/calibrate.py", line 319, in calibrate
    calib_ctx.calibrate(all_data)
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 238, in calibrate
    _ = model(data.to(self.device))
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1002, in forward
    layer_outputs = decoder_layer(
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 457, in _forward
    auto_scale_block(mod, batch_kwargs[i], self.w_bits,
  File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 349, in auto_scale_block
    _auto_get_scale(
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 331, in _auto_get_scale
    best_ratio = _search_module_scale(module2inspect, layers, inp.value,
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 278, in _search_module_scale
    org_out = block(x, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'output_router_logits'

export CUDA_VISIBLE_DEVICES=2,3
export HF_MODEL=/workspace/models/mixtral-moe
export WORK_DIR=/workspace/models/mixtral-moe-4bit

lmdeploy lite auto_awq \
   $HF_MODEL \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 1 \
  --search-scale False \
  --work-dir $WORK_DIR

AllentDan · 2024-11-12T07:09:16Z

@anaivebird Remove --search-scale False please.

anaivebird · 2024-11-12T07:30:55Z

vl_model, model, tokenizer, work_dir = calibrate(model,

Thanks, it works. But remove --search-scale False should equivalent to add --search-scale False, in both case, search-scale is False. Why the program result in different path? @AllentDan

AllentDan · 2024-11-12T07:32:26Z

It was a bug

lvhan028 · 2024-11-12T12:30:40Z

@AllentDan autotest/utils/quantization_utils.py should be updated due to the change of --search-scale

AllentDan added 3 commits November 6, 2024 14:54

moe-awq

ae76e62

Merge branch 'main' into moe-awq

b4fdc2d

skip gate

9ea5c5d

lvhan028 added the enhancement New feature or request label Nov 7, 2024

lvhan028 requested a review from zhulinJulia24 November 8, 2024 03:08

lvhan028 requested review from lvhan028 and pppppM November 8, 2024 03:10

lvhan028 reviewed Nov 8, 2024

View reviewed changes

add skipped_modules

61347af

lvhan028 approved these changes Nov 11, 2024

View reviewed changes

fix search-scale

a97945f

update autotest

407f557

lvhan028 merged commit adf7c36 into InternLM:main Nov 13, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mixtral moe AWQ quantization. #2725

Support mixtral moe AWQ quantization. #2725

AllentDan commented Nov 7, 2024

lvhan028 commented Nov 8, 2024

lvhan028 Nov 8, 2024

deepindeed2022 commented Nov 10, 2024

deepindeed2022 commented Nov 10, 2024

AllentDan commented Nov 11, 2024

deepindeed2022 commented Nov 11, 2024

anaivebird commented Nov 12, 2024

AllentDan commented Nov 12, 2024

anaivebird commented Nov 12, 2024 •

edited

Loading

AllentDan commented Nov 12, 2024

lvhan028 commented Nov 12, 2024 •

edited

Loading

Support mixtral moe AWQ quantization. #2725

Support mixtral moe AWQ quantization. #2725

Conversation

AllentDan commented Nov 7, 2024

lvhan028 commented Nov 8, 2024

lvhan028 Nov 8, 2024

Choose a reason for hiding this comment

deepindeed2022 commented Nov 10, 2024

deepindeed2022 commented Nov 10, 2024

AllentDan commented Nov 11, 2024

deepindeed2022 commented Nov 11, 2024

anaivebird commented Nov 12, 2024

AllentDan commented Nov 12, 2024

anaivebird commented Nov 12, 2024 • edited Loading

AllentDan commented Nov 12, 2024

lvhan028 commented Nov 12, 2024 • edited Loading

anaivebird commented Nov 12, 2024 •

edited

Loading

lvhan028 commented Nov 12, 2024 •

edited

Loading