[BUG]Huggingface版本推理流式输出报错 #94

cauwulixuan · 2024-01-18T04:42:07Z

我在用以下代码进行流式推理的时候，参考fastchat-inference.py 的这一部分stream_generate

for i in range(max_new_tokens):
    if i == 0:  # prefill
        out = model(input_ids=start_ids, use_cache=True)
        logits = out.logits
        past_key_values = out.past_key_values
        ...
    else:  # decoding
        out = model(
            input_ids=torch.as_tensor(
                [[token] if not sent_interrupt else output_ids],
                device=device,
            ),
            use_cache=True,
            past_key_values=past_key_values if not sent_interrupt else None,
        )
        sent_interrupt = False
        logits = out.logits
        past_key_values = out.past_key_values
    ...
    probs = torch.softmax(last_token_logits, dim=-1)
    indices = torch.multinomial(probs, num_samples=2)
    tokens = [int(token) for token in indices.tolist()]
    token = tokens[0]
    output_ids.append(token)
    ...
...

use_flash_attention=True的时候，是可以正常推理的；
use_flash_attention=False的时候，报错了，报错信息如下：

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
{'torch_dtype': torch.float16, 'revision': 'main'}
YuanForCausalLM(
  (model): YuanModel(
    (embed_tokens): Embedding(135040, 2048, padding_idx=77185)
    (layers): ModuleList(
      (0-23): 24 x YuanDecoderLayer(
        (self_attn): YuanAttention(
          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
          (lf_gate): LocalizedFiltering(
            (conv1): Conv2d(2048, 1024, kernel_size=(2, 1), stride=(1, 1), padding=(1, 0))
            (conv2): Conv2d(1024, 2048, kernel_size=(2, 1), stride=(1, 1), padding=(1, 0))
            (output_layernorm): LlamaRMSNorm()
          )
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): YuanMLP(
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=2048, out_features=135040, bias=False)
)
user: yuan2.0是谁开发的？
assistant: Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/github/FastChat/fastchat/serve/cli.py", line 304, in <module>
    main(args)
  File "/github/FastChat/fastchat/serve/cli.py", line 227, in main
    chat_loop(
  File "/github/FastChat/fastchat/serve/inference.py", line 532, in chat_loop
    outputs = chatio.stream_output(output_stream)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/github/FastChat/fastchat/serve/cli.py", line 63, in stream_output
    for outputs in output_stream:
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/github/FastChat/fastchat/serve/inference.py", line 160, in generate_stream
    out = model(
          ^^^^^^
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 938, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 768, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 426, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 358, in forward
    raise ValueError(
ValueError: Attention mask should be of size (1, 1, 1, 10), but is torch.Size([1, 1, 1, 1]

是否是和yuan_hf_model.py脚本里相关模块的处理有关？

我上述使用推理脚本还是比较常见的，所以如果可以的话，是否可以修复这个问题？

The text was updated successfully, but these errors were encountered:

ljg-ieisystem · 2024-01-18T08:55:39Z

在开发huggingface没有考虑到这种推理情况，可以通过yuan_hf_model.py 中更改以下代码段解决，之后我们会考虑该情况更新对应的代码

if self.training or self.reset_position_ids and attention_mask is not None：
            attention_mask, _ = self._prepare_decoder_attention_mask_training(input_ids1, inputs_embeds, self.eod_token, reset_mask_flag, self.reset_attention_mask, self.reset_position_ids)

cauwulixuan · 2024-01-19T02:57:57Z

这种推理情况似乎还挺常见的，我本地修改了这个文件，确实可以了，谢谢您。

这种情况仅限于我已经下载了huggingface的模型。如果直接from_pretrained("IEITYuan/Yuan2-2B-hf")，就没法手工修改了吧？后续官方会统一更新吗？

Shawn-IEITSystems · 2024-01-22T12:35:07Z

@ljg-ieisystem

cauwulixuan changed the title ~~[**BUG**]Huggingface版本推理流式输出报错~~ [*BUG*]Huggingface版本推理流式输出报错 Jan 18, 2024

cauwulixuan changed the title ~~[*BUG*]Huggingface版本推理流式输出报错~~ [BUG]Huggingface版本推理流式输出报错 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Huggingface版本推理流式输出报错 #94

[BUG]Huggingface版本推理流式输出报错 #94

cauwulixuan commented Jan 18, 2024 •

edited

Loading

ljg-ieisystem commented Jan 18, 2024

cauwulixuan commented Jan 19, 2024

Shawn-IEITSystems commented Jan 22, 2024

[BUG]Huggingface版本推理流式输出报错 #94

[BUG]Huggingface版本推理流式输出报错 #94

Comments

cauwulixuan commented Jan 18, 2024 • edited Loading

ljg-ieisystem commented Jan 18, 2024

cauwulixuan commented Jan 19, 2024

Shawn-IEITSystems commented Jan 22, 2024

cauwulixuan commented Jan 18, 2024 •

edited

Loading