MoE support for turbomind #2621

lzhangzz · 2024-10-18T08:19:29Z

No description provided.

src/turbomind/kernels/gemm/CMakeLists.txt

zhulinJulia24 · 2024-10-23T02:16:11Z

convert function fail on some models because of unsupported operand type(s):
reproduce step
lmdeploy convert internvl-internlm2 /nvme/qa_test_models/OpenGVLab/InternVL2-2B --dst-path /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B --tp 1

2024-10-22 18:50:28,942 - lmdeploy - �[33mWARNING�[0m - converter.py:318 - The argument `<model_name>` is deprecated and unused now. It will be removed on 2024.12.31. It was originally used to specify the name of the built-in chat template, but now it is substituted with a clearer parameter `--chat-template`
create workspace in directory /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B
*** splitting layers.0.attention.w_qkv.weight, shape=torch.Size([2048, 4096]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.weight, shape=torch.Size([2048, 2048]), split_dim=0, tp=1
reproduce command convert: CUDA_VISIBLE_DEVICES=5 lmdeploy convert internvl-internlm2 /nvme/qa_test_models/OpenGVLab/InternVL2-2B --dst-path /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B --tp 1

Convert to turbomind format:   0%|          | 0/24 [00:00<?, ?it/s]
                                                                   

Convert to turbomind format:   0%|          | 0/24 [00:01<?, ?it/s]
                                                                   

Convert to turbomind format:   0%|          | 0/24 [00:01<?, ?it/s]Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
    main(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
    tm_model.export()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 192, in export
    if self.model(i, reader):
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 258, in __call__
    m(i, r)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 135, in apply
    e(partial(self._export, self._ffn), partial(r.ffn, i), i)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/parameter.py", line 58, in __call__
    f(i, g('weight'), 'weight', identity)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 113, in _export
    w1 = pad_out_dims(w1, self.inter_size)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 48, in pad_out_dims
    pad = dims - x.size(-1)
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

Convert to turbomind format:   0%|          | 0/24 [00:01<?, ?it/s]

model list is:

convert function fail on some models because of Failed to find valid loader for {model_path}:
reproduce step
lmdeploy convert internlm2 /nvme/qa_test_models/internlm/internlm2_5-7b-chat-inner-gptq --dst-path /nvme/qa_test_models/autotest_model/workspace_internlm/internlm2_5-7b-chat-inner-gptq --model-format gptq --group-size 128 --tp 1

2024-10-22 18:57:21,147 - lmdeploy - �[33mWARNING�[0m - converter.py:318 - The argument `<model_name>` is deprecated and unused now. It will be removed on 2024.12.31. It was originally used to specify the name of the built-in chat template, but now it is substituted with a clearer parameter `--chat-template`
create workspace in directory /nvme/qa_test_models/autotest_model/workspace_internlm/internlm2_5-7b-chat-inner-gptq
reproduce command convert: CUDA_VISIBLE_DEVICES=5 lmdeploy convert internlm2 /nvme/qa_test_models/internlm/internlm2_5-7b-chat-inner-gptq --dst-path /nvme/qa_test_models/autotest_model/workspace_internlm/internlm2_5-7b-chat-inner-gptq --model-format gptq --group-size 128 --tp 1

Convert to turbomind format:   0%|          | 0/32 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
    main(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
    tm_model.export()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 191, in export
    for i, reader in self.input_model.readers():
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line 117, in readers
    loader = create_loader(self.model_path, self.Reader.attn_layer_patten)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/loader.py", line 134, in create_loader
    assert cls is not None, f'Failed to find valid loader for {model_path}'
AssertionError: Failed to find valid loader for /nvme/qa_test_models/internlm/internlm2_5-7b-chat-inner-gptq

Convert to turbomind format:   0%|          | 0/32 [00:00<?, ?it/s]

zhulinJulia24 · 2024-10-23T08:11:37Z

@lzhangzz it seems the converzation cannot be stoped on 4bits models during oc evaluation. for example internlm2_5-7b-chat-4bits

zhulinJulia24 · 2024-10-23T10:27:23Z

@lzhangzz convert error

lmdeploy convert internlm /nvme/qa_test_models/internlm/internlm-chat-20b --dst-path /nvme/qa_test_models/autotest_model/workspace_internlm/internlm-chat-20b --tp 2

E       AssertionError: 
E         Convert to turbomind format:   0%|          | 0/60 [00:00<?, ?it/s]Traceback (most recent call last):
E           File "/opt/py3/bin/lmdeploy", line 8, in <module>
E             sys.exit(run())
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
E             args.run(args)
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
E             main(**kwargs)
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
E             tm_model.export()
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 191, in export
E             for i, reader in self.input_model.readers():
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line [117](https://github.com/InternLM/lmdeploy/actions/runs/11474279983/job/31936819532#step:9:118), in readers
E             loader = create_loader(self.model_path, self.Reader.attn_layer_patten)
E           File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/loader.py", line 141, in create_loader
E             return PytorchLoader(*args, index_name=WEIGHT_INDEX_NAME)
E         TypeError: PytorchLoader.__init__() missing 1 required positional argument: 'file_pattern'

zhulinJulia24 · 2024-10-23T12:34:31Z

@lzhangzz

convert error
reproduce step reproduce command convert: CUDA_VISIBLE_DEVICES=5 lmdeploy convert internvl-internlm2 /nvme/qa_test_models/OpenGVLab/InternVL2-2B --dst-path /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B --tp 1


2024-10-23 20:12:02,683 - lmdeploy - �[33mWARNING�[0m - converter.py:318 - The argument `<model_name>` is deprecated and unused now. It will be removed on 2024.12.31. It was originally used to specify the name of the built-in chat template, but now it is substituted with a clearer parameter `--chat-template`
create workspace in directory /nvme/qa_test_models/autotest_model/workspace_OpenGVLab/InternVL2-2B
*** splitting layers.0.attention.w_qkv.weight, shape=torch.Size([2048, 4096]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.weight, shape=torch.Size([2048, 2048]), split_dim=0, tp=1
*** splitting layers.0.feed_forward.w1.weight, shape=torch.Size([2048, 8192]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.weight, shape=torch.Size([2048, 8192]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.weight, shape=torch.Size([8192, 2048]), split_dim=0, tp=1
*** 
*** splitting layers.9.feed_forward.w2.weight, shape=torch.Size([8192, 2048]), split_dim=0, tp=1


Convert to turbomind format:   0%|          | 0/24 [00:00<?, ?it/s]
                                                                   
Convert to turbomind format: 100%|██████████| 24/24 [00:06<00:00,  4.28it/s]Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
    main(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
    tm_model.export()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 192, in export
    if self.model(i, reader):
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 261, in __call__
    self.misc(i, r)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 237, in apply
    emb = pad_weight(emb)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 228, in pad_weight
    if vocab_size % tp != 0:
TypeError: unsupported operand type(s) for %: 'NoneType' and 'int'

Convert to turbomind format: 100%|██████████| 24/24 [00:06<00:00,  3.77it/s]

convert error 2
reproduce command convert: CUDA_VISIBLE_DEVICES=0 lmdeploy convert internlm-xcomposer2d5 /nvme/qa_test_models/internlm/internlm-xcomposer2d5-7b-inner-4bits --dst-path /nvme/qa_test_models/autotest_model/workspace_internlm/internlm-xcomposer2d5-7b-inner-4bits --model-format awq --group-size 128 --tp 1


2024-10-23 20:17:34,812 - lmdeploy - �[33mWARNING�[0m - converter.py:318 - The argument `<model_name>` is deprecated and unused now. It will be removed on 2024.12.31. It was originally used to specify the name of the built-in chat template, but now it is substituted with a clearer parameter `--chat-template`
create workspace in directory /nvme/qa_test_models/autotest_model/workspace_internlm/internlm-xcomposer2d5-7b-inner-4bits
*** splitting tok_embeddings.weight, shape=torch.Size([92544, 4096]), split_dim=1, tp=1
### copying layers.0.attention.w_qkv.lora_a.weight, shape=torch.Size([4096, 256])
*** splitting layers.0.attention.wo.lora_a.weight, shape=torch.Size([4096, 256]), split_dim=0, tp=1
*** splitting layers.0.attention.w_qkv.lora_b.weight, shape=torch.Size([256, 6144]), split_dim=-1, tp=1
### copying layers.0.attention.wo.lora_b.weight, shape=torch.Size([256, 4096])
*** splitting layers.0.attention.w_qkv.qweight, shape=torch.Size([4096, 768]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.qweight, shape=torch.Size([4096, 512]), split_dim=0, tp=1
*** splitting layers.0.attention.w_qkv.scales, shape=torch.Size([32, 6144]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.scales, shape=torch.Size([32, 4096]), split_dim=0, tp=1
*** splitting layers.0.attention.w_qkv.zeros, shape=torch.Size([32, 6144]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.zeros, shape=torch.Size([32, 4096]), split_dim=0, tp=1
### copying layers.0.feed_forward.w1.lora_a.weight, shape=torch.Size([4096, 256])
### copying layers.0.feed_forward.w3.lora_a.weight, shape=torch.Size([4096, 256])
*** splitting layers.0.feed_forward.w2.lora_a.weight, shape=torch.Size([14336, 256]), split_dim=0, tp=1
*** splitting layers.0.feed_forward.w1.lora_b.weight, shape=torch.Size([256, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.lora_b.weight, shape=torch.Size([256, 14336]), split_dim=-1, tp=1
### copying layers.0.feed_forward.w2.lora_b.weight, shape=torch.Size([256, 4096])
*** splitting layers.0.feed_forward.w1.qweight, shape=torch.Size([4096, 1792]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.qweight, shape=torch.Size([4096, 1792]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.qweight, shape=torch.Size([14336, 512]), split_dim=0, tp=1
*** splitting layers.0.feed_forward.w1.scales, shape=torch.Size([32, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.scales, shape=torch.Size([32, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.scales, shape=torch.Size([112, 4096]), split_dim=0, tp=1
*** splitting layers.0.feed_forward.w1.zeros, shape=torch.Size([32, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.zeros, shape=torch.Size([32, 14336]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.zeros, shape=torch.Size([112, 4096]), split_dim=0, tp=1


Convert to turbomind format:   0%|          | 0/32 [00:00<?, ?it/s]
                                      

Convert to turbomind format:   0%|          | 0/32 [00:07<?, ?it/s]Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 166, in convert
    main(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/converter.py", line 354, in main
    tm_model.export()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 192, in export
    if self.model(i, reader):
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 258, in __call__
    m(i, r)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 135, in apply
    e(partial(self._export, self._ffn), partial(r.ffn, i), i)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/parameter.py", line 58, in __call__
    f(i, g('weight'), 'weight', identity)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line 96, in ffn
    return self._ffn(i, kind)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/source_model/internlm2.py", line 50, in _ffn
    tensor = self.params[
KeyError: 'model.layers.0.feed_forward.w1.weight'

Convert to turbomind format:   0%|          | 0/32 [00:07<?, ?it/s]

zhulinJulia24 · 2024-10-24T02:10:56Z

The evaluations turbomind_internlm2_5_20b_chat_4bits is incorrect.

the predicions cannot be stopped

reproduce oc config is:

from copy import deepcopy

from mmengine.config import read_base
from opencompass.models import TurboMindModel, TurboMindModelwithChatTemplate

with read_base():
    # choose a list of datasets
    from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
        gsm8k_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mmlu.mmlu_gen_4d595a import \
        mmlu_datasets  # noqa: F401, E501
    # read models
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import \
        models as lmdeploy_internlm2_5_20b_chat  # noqa: F401, E501
    from opencompass.configs.summarizers.medium import \
        summarizer  # noqa: F401, E501

MAX_SESSION_LEN = 2048
MAX_NEW_TOKENS = 1024

# ===== Configs for internlm/internlm2_5_20b_chat =====
turbomind_internlm2_5_20b_chat = deepcopy(*lmdeploy_internlm2_5_20b_chat)
turbomind_internlm2_5_20b_chat_4bits = deepcopy(*lmdeploy_internlm2_5_20b_chat)

for model in [v for k, v in locals().items() if k.startswith('turbomind_')]:
    model['engine_config']['max_batch_size'] = 128
    model['gen_config']['do_sample'] = False
    model['batch_size'] = 128

for model in [v for k, v in locals().items() if k.endswith('_4bits')]:
    model['engine_config']['model_format'] = 'awq'
    model['abbr'] = model['abbr'] + '_4bits'
    model['path'] = model['path'] + '-inner-4bits'

models = [turbomind_internlm2_5_20b_chat, turbomind_internlm2_5_20b_chat_4bits]
datasets = [*gsm8k_datasets, *mmlu_datasets]

zhulinJulia24 · 2024-10-24T02:32:52Z

@lzhangzz

internvl models raise error on V100

CUDA_VISIBLE_DEVICES=3 lmdeploy chat /nvme/qa_test_models/OpenGVLab/InternVL2-2B --backend turbomind --session-len 4096 --tp 1 --dtype float16


end

简要介绍乌鲁木齐的景点#A brief introduction to Urumqi’s attractions

介绍它的相应美食#please introduce some delicious foods

end

介绍相应美食#please introduce some delicious foods

exit


error:Traceback (most recent call last):
  File "/opt/py3/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 279, in chat
    run_chat(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/chat.py", line 116, in main
    tm_model = tm.TurboMind.from_pretrained(model_path,
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 112, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 219, in _from_hf
    model_comm = _tm.AbstractTransformerModel.create_llama_model(
RuntimeError: yaml-cpp: error at line 59, column 15: bad conversion

zhulinJulia24 · 2024-10-25T02:12:32Z

@lzhangzz

chat with MiniCPM-V-2_6 raise error

lmdeploy chat /nvme/qa_test_models/openbmb/MiniCPM-V-2_6 --backend turbomind --session-len 4096 --tp 2

chat_template_config:
ChatTemplateConfig(model_name='minicpmv-2d6', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format=None, tp=2, session_len=4096, max_batch_size=1, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
Convert to turbomind format:   0%|                                                                                                                                                         | 0/28 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/zhulin1/miniconda3/envs/v6/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 279, in chat
    run_chat(**kwargs)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/chat.py", line 116, in main
    tm_model = tm.TurboMind.from_pretrained(model_path,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 112, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 232, in _from_hf
    tm_model.export()
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 207, in export
    if self.model(i, reader):
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 258, in __call__
    m(i, r)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 207, in apply
    e(self._export, partial(r.attn, i), i)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/parameter.py", line 65, in __call__
    f(i, g('bias'), 'bias', identity)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 196, in _export
    self.model.save_split(pack_fn(qkv),
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 187, in save_split
    self.export_weight(split, f'{prefix}.{i}{ext}')
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 150, in export_weight
    for tm_tensor in tm_params[name]:
KeyError: 'layers.0.attention.w_qkv.0.bias'

chat with new quantization gptq model, raise error

lmdeploy chat /nvme/qa_test_models/internlm/internlm2_5-7b-chat-inner-gptq --backend turbomind --session-len 4096 --tp 1 --model-format gptq

chat_template_config:
ChatTemplateConfig(model_name='internlm2', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format='gptq', tp=1, session_len=4096, max_batch_size=1, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
Convert to turbomind format:   0%|                                                                                                                                                         | 0/32 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/zhulin1/miniconda3/envs/v6/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
    args.run(args)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 279, in chat
    run_chat(**kwargs)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/chat.py", line 116, in main
    tm_model = tm.TurboMind.from_pretrained(model_path,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 302, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 112, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 232, in _from_hf
    tm_model.export()
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 207, in export
    if self.model(i, reader):
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 258, in __call__
    m(i, r)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 72, in __call__
    return self.apply(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 207, in apply
    e(self._export, partial(r.attn, i), i)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/parameter.py", line 65, in __call__
    f(i, g('bias'), 'bias', identity)
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/module.py", line 196, in _export
    self.model.save_split(pack_fn(qkv),
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 187, in save_split
    self.export_weight(split, f'{prefix}.{i}{ext}')
  File "/home/zhulin1/miniconda3/envs/v6/lib/python3.10/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 150, in export_weight
    for tm_tensor in tm_params[name]:
KeyError: 'layers.0.attention.w_qkv.0.bias'

zhulinJulia24 · 2024-10-25T08:43:35Z

@lzhangzz

response of MiniCPM-Llama3-V-2_5 make no sense
lmdeploy chat /nvme/qa_test_models/openbmb/MiniCPM-Llama3-V-2_5 --backend turbomind --session-len 4096 --tp 1

chat_template_config:
ChatTemplateConfig(model_name='llama3', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
engine_cfg:
TurbomindEngineConfig(dtype='auto', model_format=None, tp=1, session_len=4096, max_batch_size=1, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                                                                                    

double enter to end input >>> 你好 请介绍乌鲁木齐的景点

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

你好 请介绍乌鲁木齐的景点<|eot_id|><|start_header_id|>assistant<|end_header_id|>

� a both two of being� a and a have. and� and of bel and two both and bel and both of both only of.. two. believe both both. and have have and believe a and of bel a only being and only and bel believe and believe an and both both both and have. and of believe believe a two of and. and and both and a both and believe of believe. being and� both. two of. and a only believe and being an both and believe� both. believe believe and both and of and of and only and a believe and of. both both.� and of a. and are of and a and and believe a believe and both believe and and bel a and of of and and and both� both. believe have and and both and of a and of of being a both and and and of a. believe. a� only a and. and and both being only and believe believe and a and are only and and and. and and only of and and of only of. and and believe and an believe only and both and bel and and have both have bel. of only and and believe are of both believe and and bel.. of a have bel both both and both a and and both and and both. only. bel only have only both a a only and believe an and both have. a and bel both both and and are and. only. and only two a both of and being both and both both and two. believe only being both and only an both a and and. and and and both and bel believe and and believe a both being of and and and and and believe. and of both of a both a both bel and are. a and and both an and. and are. both and both and of both both being only believe and bel both both and. believe and and of believe and both and are and and and an both of and an.. and and are.. of. both have have both both and both and and a and. bel and of a and two both. both and a a and only of both and being and an and and and have a. and a both both. a have both a and being. of bel. a both and. have and. of and. have of and of and a and only believe an have an. have and both believe and. have both an and both believe. both both and and and bel both and. of and two a bel both and of bel both believe believe believe an believe and both have and two bel a of and only of. and both a being and have and of and and a and only both both. both. and both and only only believe both and and and both and an and a are and both both both and of believe of. a and. and bel have and and and of a and have two bel and and and. both being a and both both and and have both both. and have a and. have and of of believe of and believe of and a and and believe an of believe have and of bel and being bel both and an and and both both an. both and both a and and two and bel a both two of and and a. both two both an of of and bel and and both both and and and of and bel and both and of. both. both of. both both of and. a have an of and both of. and and and both have and a a. of of believe both both an both and both both a bel. two both and believe only and. and of. a both of and and a both believe believe and have a of have and and and both and and both both a two both. believe being and both of and both and both both. of a are a both and and believe and bel both and believe and both an believe and and an two are and have and both believe believe both an and both a and of. and and believe of and both and believe have a believe an have both both bel and both have and both and being both both bel and and of and. a and. both a have only. two of bel and bel. and and bel. both believe bel and believe being bel and both both and two bel of both and both both and an. and. and and and and. believe both and being a. a and both both both and bel a are have and and both an of believe. both and both of and a. believe have and a a a a of and believe an and and of and bel and and. and. and and and and. and both and of both. and and. of believe. both and two both both both bel believe a believe and and and of both of both of bel only and both. and both of. both a and both. of bel of of both and. a believe. have of and have and and being and being only and are of and an and both and. being of two of both a and have of of and a a and of both believe both both.

zzf2grx · 2024-11-04T06:54:06Z

MoE都支持了，是否可以在turbomind支持下Multi-Lora呀，万分感谢🙏

* initial moe support * dynamic grouped gemm * benchmark * moe benchmark * moe sampling * split-k * refactor tuning * simplify * n-major weight * add `num` for `MatrixLayout` * packed rows * packed cols * dispatch for packed rows * w4a16 moe * refactor model loading * fix pytorch loader * refactor * dispatch w4a16 moe * fix loader * add comment * fix msvc build * fix msvc build * fix msvc build * fix ut * fix ut * fix p-lora * add all support arches * minor * fix lint * fix lint * fix lint * fix ut * bf16 support * minor * refactor * fix lint * fix ut * minor * minor * minor * fix inter_size config * load with non-standard filenames * fix loader * fix missing default param * defer the loading of misc weights for safetensors * fix conversion * fix deepseek-vl * verify model config * pad inter size by group size and tp * fix minicpm attn bias & ignore un-needed bias * set `attn_bias` based on minicpm version

lzhangzz added 20 commits September 2, 2024 04:47

initial moe support

81458e9

dynamic grouped gemm

de31050

benchmark

9b71f34

moe benchmark

8538b2c

moe sampling

3b731e2

split-k

e3c2faa

refactor tuning

e99848b

simplify

2b67a53

n-major weight

5176288

add num for MatrixLayout

018c338

packed rows

fee20be

packed cols

9fcaabc

dispatch for packed rows

815c581

w4a16 moe

a4a81d9

refactor model loading

1db7fe1

fix pytorch loader

04fd8b4

refactor

b3ceb17

dispatch w4a16 moe

c8d4ed5

fix loader

3e355df

add comment

da97d5f

lvhan028 added the enhancement New feature or request label Oct 18, 2024

lzhangzz added 7 commits October 18, 2024 10:45

Merge remote-tracking branch 'origin/main' into moe

7a4d6fb

fix msvc build

4ca33a2

fix msvc build

75cf858

fix msvc build

ad26ada

fix ut

ce59d29

fix ut

36a4e4e

fix p-lora

bd089a9

lvhan028 reviewed Oct 21, 2024

View reviewed changes

src/turbomind/kernels/gemm/CMakeLists.txt Outdated Show resolved Hide resolved

lvhan028 reviewed Oct 21, 2024

View reviewed changes

src/turbomind/kernels/gemm/CMakeLists.txt Outdated Show resolved Hide resolved

minor

504093a

zhulinJulia24 mentioned this pull request Oct 23, 2024

[ci] add v100 testworkflow #2634

Closed

lzhangzz added 2 commits October 23, 2024 05:04

fix inter_size config

4769ef8

load with non-standard filenames

b1fa486

lzhangzz added 3 commits October 23, 2024 08:23

fix loader

678fbed

fix missing default param

5e1c9d7

defer the loading of misc weights for safetensors

1c66685

fix conversion

bff5b04

lzhangzz added 3 commits October 24, 2024 04:47

fix deepseek-vl

0b2c27b

verify model config

b07f395

pad inter size by group size and tp

23b5596

fix minicpm attn bias & ignore un-needed bias

ee1517d

lvhan028 approved these changes Oct 25, 2024

View reviewed changes

lvhan028 mentioned this pull request Oct 25, 2024

Bump version to v0.6.2 #2659

Merged

lvhan028 requested a review from irexyc October 25, 2024 07:43

irexyc approved these changes Oct 25, 2024

View reviewed changes

lvhan028 self-requested a review October 25, 2024 08:36

set attn_bias based on minicpm version

225f7d0

lvhan028 approved these changes Oct 25, 2024

View reviewed changes

lvhan028 merged commit 962e760 into InternLM:main Oct 25, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE support for turbomind #2621

MoE support for turbomind #2621

lzhangzz commented Oct 18, 2024

zhulinJulia24 commented Oct 23, 2024 •

edited

Loading

zhulinJulia24 commented Oct 23, 2024

zhulinJulia24 commented Oct 23, 2024

zhulinJulia24 commented Oct 23, 2024 •

edited

Loading

zhulinJulia24 commented Oct 24, 2024 •

edited

Loading

zhulinJulia24 commented Oct 24, 2024

zhulinJulia24 commented Oct 25, 2024 •

edited

Loading

zhulinJulia24 commented Oct 25, 2024

zzf2grx commented Nov 4, 2024

MoE support for turbomind #2621

MoE support for turbomind #2621

Conversation

lzhangzz commented Oct 18, 2024

zhulinJulia24 commented Oct 23, 2024 • edited Loading

zhulinJulia24 commented Oct 23, 2024

zhulinJulia24 commented Oct 23, 2024

zhulinJulia24 commented Oct 23, 2024 • edited Loading

zhulinJulia24 commented Oct 24, 2024 • edited Loading

zhulinJulia24 commented Oct 24, 2024

zhulinJulia24 commented Oct 25, 2024 • edited Loading

zhulinJulia24 commented Oct 25, 2024

zzf2grx commented Nov 4, 2024

zhulinJulia24 commented Oct 23, 2024 •

edited

Loading

zhulinJulia24 commented Oct 23, 2024 •

edited

Loading

zhulinJulia24 commented Oct 24, 2024 •

edited

Loading

zhulinJulia24 commented Oct 25, 2024 •

edited

Loading