Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE support for turbomind #2621

Merged
merged 52 commits into from
Oct 25, 2024
Merged

MoE support for turbomind #2621

merged 52 commits into from
Oct 25, 2024

Conversation

lzhangzz
Copy link
Collaborator

No description provided.

@lvhan028 lvhan028 added the enhancement New feature or request label Oct 18, 2024
@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented Oct 23, 2024

@lzhangzz

  1. convert function fail on some models because of unsupported operand type(s):
    reproduce step
    lmdeploy convert internvl-internlm2 /nvme/qa_test_models/OpenGVLab/InternVL2-2B --dst-path /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B --tp 1
2024-10-22 18:50:28,942 - lmdeploy - �[33mWARNING�[0m - converter.py:318 - The argument `<model_name>` is deprecated and unused now. It will be removed on 2024.12.31. It was originally used to specify the name of the built-in chat template, but now it is substituted with a clearer parameter `--chat-template`
create workspace in directory /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B
*** splitting layers.0.attention.w_qkv.weight, shape=torch.Size([2048, 4096]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.weight, shape=torch.Size([2048, 2048]), split_dim=0, tp=1
reproduce command convert: CUDA_VISIBLE_DEVICES=5 lmdeploy convert internvl-internlm2 /nvme/qa_test_models/OpenGVLab/InternVL2-2B --dst-path /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B --tp 1

Convert to turbomind format:   0%|          | 0/24 [00:00<?, ?it/s]
                                                                   

Convert to turbomind format:   0%|          | 0/24 [00:01<?, ?it/s]
                                                                   

Convert to turbomind format:   0%|          | 0/24 [00:01<?, ?it/s]Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
    main(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
    tm_model.export()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 192, in export
    if self.model(i, reader):
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 258, in __call__
    m(i, r)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 135, in apply
    e(partial(self._export, self._ffn), partial(r.ffn, i), i)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/parameter.py", line 58, in __call__
    f(i, g('weight'), 'weight', identity)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 113, in _export
    w1 = pad_out_dims(w1, self.inter_size)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 48, in pad_out_dims
    pad = dims - x.size(-1)
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

Convert to turbomind format:   0%|          | 0/24 [00:01<?, ?it/s]

model list is:
image

  1. convert function fail on some models because of Failed to find valid loader for {model_path}:
    reproduce step
    lmdeploy convert internlm2 /nvme/qa_test_models/internlm/internlm2_5-7b-chat-inner-gptq --dst-path /nvme/qa_test_models/autotest_model/workspace_internlm/internlm2_5-7b-chat-inner-gptq --model-format gptq --group-size 128 --tp 1
2024-10-22 18:57:21,147 - lmdeploy - �[33mWARNING�[0m - converter.py:318 - The argument `<model_name>` is deprecated and unused now. It will be removed on 2024.12.31. It was originally used to specify the name of the built-in chat template, but now it is substituted with a clearer parameter `--chat-template`
create workspace in directory /nvme/qa_test_models/autotest_model/workspace_internlm/internlm2_5-7b-chat-inner-gptq
reproduce command convert: CUDA_VISIBLE_DEVICES=5 lmdeploy convert internlm2 /nvme/qa_test_models/internlm/internlm2_5-7b-chat-inner-gptq --dst-path /nvme/qa_test_models/autotest_model/workspace_internlm/internlm2_5-7b-chat-inner-gptq --model-format gptq --group-size 128 --tp 1

Convert to turbomind format:   0%|          | 0/32 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
    main(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
    tm_model.export()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 191, in export
    for i, reader in self.input_model.readers():
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line 117, in readers
    loader = create_loader(self.model_path, self.Reader.attn_layer_patten)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/loader.py", line 134, in create_loader
    assert cls is not None, f'Failed to find valid loader for {model_path}'
AssertionError: Failed to find valid loader for /nvme/qa_test_models/internlm/internlm2_5-7b-chat-inner-gptq

Convert to turbomind format:   0%|          | 0/32 [00:00<?, ?it/s]

@zhulinJulia24
Copy link
Collaborator

@lzhangzz it seems the converzation cannot be stoped on 4bits models during oc evaluation. for example internlm2_5-7b-chat-4bits
image

@zhulinJulia24
Copy link
Collaborator

@lzhangzz convert error

lmdeploy convert internlm /nvme/qa_test_models/internlm/internlm-chat-20b --dst-path /nvme/qa_test_models/autotest_model/workspace_internlm/internlm-chat-20b --tp 2

E       AssertionError: 
E         Convert to turbomind format:   0%|          | 0/60 [00:00<?, ?it/s]Traceback (most recent call last):
E           File "/opt/py3/bin/lmdeploy", line 8, in <module>
E             sys.exit(run())
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
E             args.run(args)
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
E             main(**kwargs)
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
E             tm_model.export()
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 191, in export
E             for i, reader in self.input_model.readers():
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line [117](https://github.com/InternLM/lmdeploy/actions/runs/11474279983/job/31936819532#step:9:118), in readers
E             loader = create_loader(self.model_path, self.Reader.attn_layer_patten)
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/loader.py", line 141, in create_loader
E             return PytorchLoader(*args, index_name=WEIGHT_INDEX_NAME)
E         TypeError: PytorchLoader.__init__() missing 1 required positional argument: 'file_pattern'

@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented Oct 23, 2024

@lzhangzz

  1. convert error
    reproduce step reproduce command convert: CUDA_VISIBLE_DEVICES=5 lmdeploy convert internvl-internlm2 /nvme/qa_test_models/OpenGVLab/InternVL2-2B --dst-path /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B --tp 1

2024-10-23 20:12:02,683 - lmdeploy - �[33mWARNING�[0m - converter.py:318 - The argument `<model_name>` is deprecated and unused now. It will be removed on 2024.12.31. It was originally used to specify the name of the built-in chat template, but now it is substituted with a clearer parameter `--chat-template`
create workspace in directory /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B
*** splitting layers.0.attention.w_qkv.weight, shape=torch.Size([2048, 4096]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.weight, shape=torch.Size([2048, 2048]), split_dim=0, tp=1
*** splitting layers.0.feed_forward.w1.weight, shape=torch.Size([2048, 8192]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.weight, shape=torch.Size([2048, 8192]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.weight, shape=torch.Size([8192, 2048]), split_dim=0, tp=1
*** 
*** splitting layers.9.feed_forward.w2.weight, shape=torch.Size([8192, 2048]), split_dim=0, tp=1


Convert to turbomind format:   0%|          | 0/24 [00:00<?, ?it/s]
                                                                   
Convert to turbomind format: 100%|██████████| 24/24 [00:06<00:00,  4.28it/s]Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
    main(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
    tm_model.export()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 192, in export
    if self.model(i, reader):
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 261, in __call__
    self.misc(i, r)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 237, in apply
    emb = pad_weight(emb)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 228, in pad_weight
    if vocab_size % tp != 0:
TypeError: unsupported operand type(s) for %: 'NoneType' and 'int'

Convert to turbomind format: 100%|██████████| 24/24 [00:06<00:00,  3.77it/s]
  1. convert error 2
    reproduce command convert: CUDA_VISIBLE_DEVICES=0 lmdeploy convert internlm-xcomposer2d5 /nvme/qa_test_models/internlm/internlm-xcomposer2d5-7b-inner-4bits --dst-path /nvme/qa_test_models/autotest_model/workspace_internlm/internlm-xcomposer2d5-7b-inner-4bits --model-format awq --group-size 128 --tp 1

2024-10-23 20:17:34,812 - lmdeploy - �[33mWARNING�[0m - converter.py:318 - The argument `<model_name>` is deprecated and unused now. It will be removed on 2024.12.31. It was originally used to specify the name of the built-in chat template, but now it is substituted with a clearer parameter `--chat-template`
create workspace in directory /nvme/qa_test_models/autotest_model/workspace_internlm/internlm-xcomposer2d5-7b-inner-4bits
*** splitting tok_embeddings.weight, shape=torch.Size([92544, 4096]), split_dim=1, tp=1
### copying layers.0.attention.w_qkv.lora_a.weight, shape=torch.Size([4096, 256])
*** splitting layers.0.attention.wo.lora_a.weight, shape=torch.Size([4096, 256]), split_dim=0, tp=1
*** splitting layers.0.attention.w_qkv.lora_b.weight, shape=torch.Size([256, 6144]), split_dim=-1, tp=1
### copying layers.0.attention.wo.lora_b.weight, shape=torch.Size([256, 4096])
*** splitting layers.0.attention.w_qkv.qweight, shape=torch.Size([4096, 768]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.qweight, shape=torch.Size([4096, 512]), split_dim=0, tp=1
*** splitting layers.0.attention.w_qkv.scales, shape=torch.Size([32, 6144]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.scales, shape=torch.Size([32, 4096]), split_dim=0, tp=1
*** splitting layers.0.attention.w_qkv.zeros, shape=torch.Size([32, 6144]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.zeros, shape=torch.Size([32, 4096]), split_dim=0, tp=1
### copying layers.0.feed_forward.w1.lora_a.weight, shape=torch.Size([4096, 256])
### copying layers.0.feed_forward.w3.lora_a.weight, shape=torch.Size([4096, 256])
*** splitting layers.0.feed_forward.w2.lora_a.weight, shape=torch.Size([14336, 256]), split_dim=0, tp=1
*** splitting layers.0.feed_forward.w1.lora_b.weight, shape=torch.Size([256, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.lora_b.weight, shape=torch.Size([256, 14336]), split_dim=-1, tp=1
### copying layers.0.feed_forward.w2.lora_b.weight, shape=torch.Size([256, 4096])
*** splitting layers.0.feed_forward.w1.qweight, shape=torch.Size([4096, 1792]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.qweight, shape=torch.Size([4096, 1792]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.qweight, shape=torch.Size([14336, 512]), split_dim=0, tp=1
*** splitting layers.0.feed_forward.w1.scales, shape=torch.Size([32, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.scales, shape=torch.Size([32, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.scales, shape=torch.Size([112, 4096]), split_dim=0, tp=1
*** splitting layers.0.feed_forward.w1.zeros, shape=torch.Size([32, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.zeros, shape=torch.Size([32, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.zeros, shape=torch.Size([112, 4096]), split_dim=0, tp=1


Convert to turbomind format:   0%|          | 0/32 [00:00<?, ?it/s]
                                      

Convert to turbomind format:   0%|          | 0/32 [00:07<?, ?it/s]Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
    main(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
    tm_model.export()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 192, in export
    if self.model(i, reader):
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 258, in __call__
    m(i, r)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 135, in apply
    e(partial(self._export, self._ffn), partial(r.ffn, i), i)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/parameter.py", line 58, in __call__
    f(i, g('weight'), 'weight', identity)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line 96, in ffn
    return self._ffn(i, kind)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/source_model/internlm2.py", line 50, in _ffn
    tensor = self.params[
KeyError: 'model.layers.0.feed_forward.w1.weight'

Convert to turbomind format:   0%|          | 0/32 [00:07<?, ?it/s]

@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented Oct 24, 2024

image
The evaluations turbomind_internlm2_5_20b_chat_4bits is incorrect.
image
the predicions cannot be stopped
image

reproduce oc config is:

from copy import deepcopy

from mmengine.config import read_base
from opencompass.models import TurboMindModel, TurboMindModelwithChatTemplate

with read_base():
    # choose a list of datasets
    from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
        gsm8k_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mmlu.mmlu_gen_4d595a import \
        mmlu_datasets  # noqa: F401, E501
    # read models
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import \
        models as lmdeploy_internlm2_5_20b_chat  # noqa: F401, E501
    from opencompass.configs.summarizers.medium import \
        summarizer  # noqa: F401, E501

MAX_SESSION_LEN = 2048
MAX_NEW_TOKENS = 1024

# ===== Configs for internlm/internlm2_5_20b_chat =====
turbomind_internlm2_5_20b_chat = deepcopy(*lmdeploy_internlm2_5_20b_chat)
turbomind_internlm2_5_20b_chat_4bits = deepcopy(*lmdeploy_internlm2_5_20b_chat)

for model in [v for k, v in locals().items() if k.startswith('turbomind_')]:
    model['engine_config']['max_batch_size'] = 128
    model['gen_config']['do_sample'] = False
    model['batch_size'] = 128

for model in [v for k, v in locals().items() if k.endswith('_4bits')]:
    model['engine_config']['model_format'] = 'awq'
    model['abbr'] = model['abbr'] + '_4bits'
    model['path'] = model['path'] + '-inner-4bits'

models = [turbomind_internlm2_5_20b_chat, turbomind_internlm2_5_20b_chat_4bits]
datasets = [*gsm8k_datasets, *mmlu_datasets]

@zhulinJulia24
Copy link
Collaborator

@lzhangzz

internvl models raise error on V100

CUDA_VISIBLE_DEVICES=3 lmdeploy chat /nvme/qa_test_models/OpenGVLab/InternVL2-2B --backend turbomind --session-len 4096 --tp 1 --dtype float16


end

简要介绍乌鲁木齐的景点#A brief introduction to Urumqi’s attractions

介绍它的相应美食#please introduce some delicious foods

end

介绍相应美食#please introduce some delicious foods

exit


error:Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 279, in chat
    run_chat(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/chat.py", line 116, in main
    tm_model = tm.TurboMind.from_pretrained(model_path,
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 112, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 219, in _from_hf
    model_comm = _tm.AbstractTransformerModel.create_llama_model(
RuntimeError: yaml-cpp: error at line 59, column 15: bad conversion

@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented Oct 25, 2024

@lzhangzz

  1. chat with MiniCPM-V-2_6 raise error

lmdeploy chat /nvme/qa_test_models/openbmb/MiniCPM-V-2_6 --backend turbomind --session-len 4096 --tp 2

chat_template_config:
ChatTemplateConfig(model_name='minicpmv-2d6', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format=None, tp=2, session_len=4096, max_batch_size=1, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
Convert to turbomind format:   0%|                                                                                                                                                         | 0/28 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/zhulin1/miniconda3/envs/v6/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 279, in chat
    run_chat(**kwargs)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/chat.py", line 116, in main
    tm_model = tm.TurboMind.from_pretrained(model_path,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 112, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 232, in _from_hf
    tm_model.export()
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 207, in export
    if self.model(i, reader):
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 258, in __call__
    m(i, r)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 207, in apply
    e(self._export, partial(r.attn, i), i)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/parameter.py", line 65, in __call__
    f(i, g('bias'), 'bias', identity)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 196, in _export
    self.model.save_split(pack_fn(qkv),
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 187, in save_split
    self.export_weight(split, f'{prefix}.{i}{ext}')
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 150, in export_weight
    for tm_tensor in tm_params[name]:
KeyError: 'layers.0.attention.w_qkv.0.bias'
  1. chat with new quantization gptq model, raise error

lmdeploy chat /nvme/qa_test_models/internlm/internlm2_5-7b-chat-inner-gptq --backend turbomind --session-len 4096 --tp 1 --model-format gptq

chat_template_config:
ChatTemplateConfig(model_name='internlm2', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format='gptq', tp=1, session_len=4096, max_batch_size=1, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
Convert to turbomind format:   0%|                                                                                                                                                         | 0/32 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/zhulin1/miniconda3/envs/v6/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 279, in chat
    run_chat(**kwargs)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/chat.py", line 116, in main
    tm_model = tm.TurboMind.from_pretrained(model_path,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 112, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 232, in _from_hf
    tm_model.export()
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 207, in export
    if self.model(i, reader):
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 258, in __call__
    m(i, r)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 207, in apply
    e(self._export, partial(r.attn, i), i)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/parameter.py", line 65, in __call__
    f(i, g('bias'), 'bias', identity)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 196, in _export
    self.model.save_split(pack_fn(qkv),
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 187, in save_split
    self.export_weight(split, f'{prefix}.{i}{ext}')
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 150, in export_weight
    for tm_tensor in tm_params[name]:
KeyError: 'layers.0.attention.w_qkv.0.bias'

@lvhan028 lvhan028 mentioned this pull request Oct 25, 2024
@lvhan028 lvhan028 requested a review from irexyc October 25, 2024 07:43
@lvhan028 lvhan028 self-requested a review October 25, 2024 08:36
@zhulinJulia24
Copy link
Collaborator

@lzhangzz

  1. response of MiniCPM-Llama3-V-2_5 make no sense
    lmdeploy chat /nvme/qa_test_models/openbmb/MiniCPM-Llama3-V-2_5 --backend turbomind --session-len 4096 --tp 1
chat_template_config:
ChatTemplateConfig(model_name='llama3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format=None, tp=1, session_len=4096, max_batch_size=1, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                                                                                    

double enter to end input >>> 你好 请介绍乌鲁木齐的景点

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

你好 请介绍乌鲁木齐的景点<|eot_id|><|start_header_id|>assistant<|end_header_id|>

� a both two of being� a and a have. and� and of bel and two both and bel and both of both only of.. two. believe both both. and have have and believe a and of bel a only being and only and bel believe and believe an and both both both and have. and of believe believe a two of and. and and both and a both and believe of believe. being and� both. two of. and a only believe and being an both and believe� both. believe believe and both and of and of and only and a believe and of. both both.� and of a. and are of and a and and believe a believe and both believe and and bel a and of of and and and both� both. believe have and and both and of a and of of being a both and and and of a. believe. a� only a and. and and both being only and believe believe and a and are only and and and. and and only of and and of only of. and and believe and an believe only and both and bel and and have both have bel. of only and and believe are of both believe and and bel.. of a have bel both both and both a and and both and and both. only. bel only have only both a a only and believe an and both have. a and bel both both and and are and. only. and only two a both of and being both and both both and two. believe only being both and only an both a and and. and and and both and bel believe and and believe a both being of and and and and and believe. and of both of a both a both bel and are. a and and both an and. and are. both and both and of both both being only believe and bel both both and. believe and and of believe and both and are and and and an both of and an.. and and are.. of. both have have both both and both and and a and. bel and of a and two both. both and a a and only of both and being and an and and and have a. and a both both. a have both a and being. of bel. a both and. have and. of and. have of and of and a and only believe an have an. have and both believe and. have both an and both believe. both both and and and bel both and. of and two a bel both and of bel both believe believe believe an believe and both have and two bel a of and only of. and both a being and have and of and and a and only both both. both. and both and only only believe both and and and both and an and a are and both both both and of believe of. a and. and bel have and and and of a and have two bel and and and. both being a and both both and and have both both. and have a and. have and of of believe of and believe of and a and and believe an of believe have and of bel and being bel both and an and and both both an. both and both a and and two and bel a both two of and and a. both two both an of of and bel and and both both and and and of and bel and both and of. both. both of. both both of and. a have an of and both of. and and and both have and a a. of of believe both both an both and both both a bel. two both and believe only and. and of. a both of and and a both believe believe and have a of have and and and both and and both both a two both. believe being and both of and both and both both. of a are a both and and believe and bel both and believe and both an believe and and an two are and have and both believe believe both an and both a and of. and and believe of and both and believe have a believe an have both both bel and both have and both and being both both bel and and of and. a and. both a have only. two of bel and bel. and and bel. both believe bel and believe being bel and both both and two bel of both and both both and an. and. and and and and. believe both and being a. a and both both both and bel a are have and and both an of believe. both and both of and a. believe have and a a a a of and believe an and and of and bel and and. and. and and and and. and both and of both. and and. of believe. both and two both both both bel believe a believe and and and of both of both of bel only and both. and both of. both a and both. of bel of of both and. a believe. have of and have and and being and being only and are of and an and both and. being of two of both a and have of of and a a and of both believe both both.

@lvhan028 lvhan028 merged commit 962e760 into InternLM:main Oct 25, 2024
9 checks passed
@zzf2grx
Copy link

zzf2grx commented Nov 4, 2024

MoE都支持了,是否可以在turbomind支持下Multi-Lora呀,万分感谢🙏

AllentDan pushed a commit to AllentDan/lmdeploy that referenced this pull request Nov 13, 2024
* initial moe support

* dynamic grouped gemm

* benchmark

* moe benchmark

* moe sampling

* split-k

* refactor tuning

* simplify

* n-major weight

* add `num` for `MatrixLayout`

* packed rows

* packed cols

* dispatch for packed rows

* w4a16 moe

* refactor model loading

* fix pytorch loader

* refactor

* dispatch w4a16 moe

* fix loader

* add comment

* fix msvc build

* fix msvc build

* fix msvc build

* fix ut

* fix ut

* fix p-lora

* add all support arches

* minor

* fix lint

* fix lint

* fix lint

* fix ut

* bf16 support

* minor

* refactor

* fix lint

* fix ut

* minor

* minor

* minor

* fix inter_size config

* load with non-standard filenames

* fix loader

* fix missing default param

* defer the loading of misc weights for safetensors

* fix conversion

* fix deepseek-vl

* verify model config

* pad inter size by group size and tp

* fix minicpm attn bias & ignore un-needed bias

* set `attn_bias` based on minicpm version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants