Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mixtral moe AWQ quantization. #2725

Merged
merged 6 commits into from
Nov 13, 2024
Merged

Conversation

AllentDan
Copy link
Collaborator

No description provided.

@lvhan028 lvhan028 added the enhancement New feature or request label Nov 7, 2024
@lvhan028
Copy link
Collaborator

lvhan028 commented Nov 8, 2024

@zhulinJulia24 please add mistralai/Mixtral-8x7B-Instruct-v0.1 awq quantization into test cases

@lvhan028 lvhan028 requested review from lvhan028 and pppppM November 8, 2024 03:10
@@ -244,6 +254,9 @@ def quant_weights(model,
if skip_if_contains and skip_if_contains in child_name:
q_linear = fc
pack_or_skip = 'skipped'
elif 'block_sparse_moe.gate' in name: # moe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an additional skip patch, considering we already possess the skip_if _contains functionality. There is a growing concern that as more Mixtures of Experts (MoEs) are integrated, the skipping branches may become increasingly difficult to maintain.

@deepindeed2022
Copy link
Contributor

使用模型 Plap-8x13B 进行量化的过程中
lmdeploy lite auto_awq models/Plap-8x13B --work-dir models/Plap-8x13B/awq
https://github.com/InternLM/lmdeploy/pull/2725/files#diff-f5acea3cc09c0c379b3f5df99146564676fcf925849a161c480c9289834b0023L108 出现IndexError,
File "/opt/lmdeploy/lmdeploy/lite/quantization/activation/observer.py", line 108, in observe
cur_max = cur_val.max(0)[0].cpu()
cur_val 值为tensor([], device='cuda:0', dtype=torch.bfloat16)
截屏2024-11-10 12 15 19

请问可能是什么原因?是我们这个pr的bug还是awq量化算法的问题,麻烦提供点思路。

@deepindeed2022
Copy link
Contributor

使用模型 Plap-8x13B 进行量化的过程中 lmdeploy lite auto_awq models/Plap-8x13B --work-dir models/Plap-8x13B/awqhttps://github.com/InternLM/lmdeploy/pull/2725/files#diff-f5acea3cc09c0c379b3f5df99146564676fcf925849a161c480c9289834b0023L108 出现IndexError, File "/opt/lmdeploy/lmdeploy/lite/quantization/activation/observer.py", line 108, in observe cur_max = cur_val.max(0)[0].cpu() cur_val 值为tensor([], device='cuda:0', dtype=torch.bfloat16) 截屏2024-11-10 12 15 19

请问可能是什么原因?是我们这个pr的bug还是awq量化算法的问题,麻烦提供点思路。

测试 mistralai/Mixtral-8x7B-Instruct-v0.1 模型是OK的。8x13B 的 hidden_size 是 5120,8x7B 的hidden_size 是 4096.

@AllentDan
Copy link
Collaborator Author

模型访问不了。可以断点进 observer.py 108 行,应该是模型推理过程某个层前后有个 tensor shape 不对

@deepindeed2022
Copy link
Contributor

模型访问不了。可以断点进 observer.py 108 行,应该是模型推理过程某个层前后有个 tensor shape 不对

  1. 模型可以直接在HF上下载。有一个弹窗,点一下就可以跳过。
  2. 是一个expert的输入存在问题,如下
Layer:model.layers.0.self_attn.q_proj, group:inputs, shape:torch.Size([1, 2048, 5120]), weight:torch.Size([5120, 5120])
Layer:model.layers.0.self_attn.k_proj, group:inputs, shape:torch.Size([1, 2048, 5120]), weight:torch.Size([5120, 5120])
Layer:model.layers.0.self_attn.v_proj, group:inputs, shape:torch.Size([1, 2048, 5120]), weight:torch.Size([5120, 5120])
Layer:model.layers.0.self_attn.o_proj, group:inputs, shape:torch.Size([1, 2048, 5120]), weight:torch.Size([5120, 5120])
Layer:model.layers.0.block_sparse_moe.gate, group:inputs, shape:torch.Size([2048, 5120]), weight:torch.Size([8, 5120])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(605)forward()
-> current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
(Pdb) ll
604  	    def forward(self, hidden_states):
605 B->	        current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
606  	        current_hidden_states = self.w2(current_hidden_states)
607  	        return current_hidden_states
(Pdb) p hidden_states.shape
torch.Size([12, 5120])
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.0.w1, group:inputs, shape:torch.Size([12, 5120]), weight:torch.Size([13824, 5120])
Layer:model.layers.0.block_sparse_moe.experts.0.w3, group:inputs, shape:torch.Size([12, 5120]), weight:torch.Size([13824, 5120])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(606)forward()
-> current_hidden_states = self.w2(current_hidden_states)
(Pdb) p current_hidden_states.shape
torch.Size([12, 13824])
(Pdb) p self.w2.weight.shape
torch.Size([5120, 13824])
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.0.w2, group:inputs, shape:torch.Size([12, 13824]), weight:torch.Size([5120, 13824])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(607)forward()
-> return current_hidden_states
(Pdb) n
--Return--
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(607)forward()->tensor([[ 0.0...torch.float16)
-> return current_hidden_states
(Pdb) n
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(673)forward()
-> final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
(Pdb) c
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(605)forward()
-> current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
(Pdb) p hidden_states.shape
torch.Size([1, 5120])
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.1.w1, group:inputs, shape:torch.Size([1, 5120]), weight:torch.Size([13824, 5120])
Layer:model.layers.0.block_sparse_moe.experts.1.w3, group:inputs, shape:torch.Size([1, 5120]), weight:torch.Size([13824, 5120])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(606)forward()
-> current_hidden_states = self.w2(current_hidden_states)
(Pdb) p current_hidden_states.shape
torch.Size([1, 13824])
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.1.w2, group:inputs, shape:torch.Size([1, 13824]), weight:torch.Size([5120, 13824])
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(607)forward()
-> return current_hidden_states
(Pdb) c
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(605)forward()
-> current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
(Pdb) n
Layer:model.layers.0.block_sparse_moe.experts.2.w1, group:inputs, shape:torch.Size([0, 5120]), weight:torch.Size([13824, 5120])
IndexError: max(): Expected reduction dim 0 to have non-zero size.
> /opt/py3/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py(605)forward()
-> current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
(Pdb) p hidden_states.shape
torch.Size([0, 5120])

这个可能是模型的原因,中间一个 moe层 的 expert_mask 为全 0 会导致这个问题。

@anaivebird
Copy link

Error when awq quantize mistralai/Mixtral-8x7B-Instruct-v0.1 model @AllentDan

Traceback (most recent call last):
  File "/opt/py38/bin/lmdeploy", line 11, in <module>
    load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')()
  File "/workspace/xingwu/lmdeploy/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/workspace/xingwu/lmdeploy/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/apis/auto_awq.py", line 91, in auto_awq
    vl_model, model, tokenizer, work_dir = calibrate(model,
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/apis/calibrate.py", line 319, in calibrate
    calib_ctx.calibrate(all_data)
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 238, in calibrate
    _ = model(data.to(self.device))
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1002, in forward
    layer_outputs = decoder_layer(
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 457, in _forward
    auto_scale_block(mod, batch_kwargs[i], self.w_bits,
  File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 349, in auto_scale_block
    _auto_get_scale(
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 331, in _auto_get_scale
    best_ratio = _search_module_scale(module2inspect, layers, inp.value,
  File "/workspace/xingwu/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 278, in _search_module_scale
    org_out = block(x, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'output_router_logits'
export CUDA_VISIBLE_DEVICES=2,3
export HF_MODEL=/workspace/models/mixtral-moe
export WORK_DIR=/workspace/models/mixtral-moe-4bit

lmdeploy lite auto_awq \
   $HF_MODEL \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 1 \
  --search-scale False \
  --work-dir $WORK_DIR

@AllentDan
Copy link
Collaborator Author

@anaivebird Remove --search-scale False please.

@anaivebird
Copy link

anaivebird commented Nov 12, 2024

vl_model, model, tokenizer, work_dir = calibrate(model,

Thanks, it works. But remove --search-scale False should equivalent to add --search-scale False, in both case, search-scale is False. Why the program result in different path? @AllentDan

@AllentDan
Copy link
Collaborator Author

It was a bug

@lvhan028
Copy link
Collaborator

lvhan028 commented Nov 12, 2024

@AllentDan autotest/utils/quantization_utils.py should be updated due to the change of --search-scale

@lvhan028 lvhan028 merged commit adf7c36 into InternLM:main Nov 13, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants