Qwen-7B-Chat模型按照Quto-GPTQ示例进行4bit量化，报错：ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)[BUG] <title> #646

sunyclj · 2023-11-17T07:41:49Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

运行到model.quantize(examples)，报错：ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)，请问会是什么原因？

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

lonngxiang · 2023-11-21T07:51:51Z

同报错；内容
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

jklj077 · 2023-11-21T08:02:13Z

您模型可能加载到内存里了，如下方案供参考

强制所有的参数加载到显存中（不要用device_map='auto'，直接指定到'cuda:0'之类的，或者不设置，加载后再转移模型到GPU）
卸载flash-attn

lonngxiang · 2023-11-21T08:04:30Z

您模型可能加载到内存里了，如下方案供参考

强制所有的参数加载到显存中（不要用device_map='auto'，直接指定到'cuda:0'之类的，或者不设置，加载后再转移模型到GPU）

卸载flash-attn

这边测试单卡cuda:0，但因为显卡不够所以放弃了；这能多显张卡运行吗

lonngxiang · 2023-11-21T09:06:57Z

如果我只用cpu量化，同意报错

AttributeError: 'QWenLMHeadModel' object has no attribute 'quantize'

jklj077 · 2023-11-21T09:46:11Z

@lonngxiang 还是要用GPU的，但我们没试过多卡量化。您多卡加载后(device_map='auto')，如果显存足够，应该不会有在内存里的参数（如果有的话，可以打印下model.hf_device_map，看看哪些到内存上了）。

lonngxiang · 2023-11-21T10:08:14Z

@lonngxiang 还是要用GPU的，但我们没试过多卡量化。您多卡加载后(device_map='auto')，如果显存足够，应该不会有在内存里的参数（如果有的话，可以打印下model.hf_device_map，看看哪些到内存上了）。

任然报错

代码：

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

import torch
device=torch.device("cpu")
#device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "/mnt/data/loong/Qwen-7B-Chat"
quantized_model_dir = "/mnt/data/loong/Qwen-7B-Chat-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, trust_remote_code=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.", return_tensors="pt").to(device)
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, trust_remote_code=True, low_cpu_mem_usage=True, device_ma
p='auto')

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

sunyclj · 2023-11-23T09:35:27Z

quantize_config

按照提供的两种解决方案，我这也是一样的报错，“RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!”

sunyclj · 2023-11-23T10:01:56Z

您模型可能加载到内存里了，如下方案供参考

强制所有的参数加载到显存中（不要用device_map='auto'，直接指定到'cuda:0'之类的，或者不设置，加载后再转移模型到GPU）

卸载flash-attn

model.hf_device_map

加载模型设置device_map="cuda:1"，为什么报错依然是在0卡和cpu上呢？“RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!”

MiyazonoKaori · 2023-11-26T04:50:28Z

修改modeling_qwen.py
`
def apply_rotary_pos_emb(t, freqs):
cos, sin = freqs
cos = cos.to(t.device)
sin = sin.to(t.device)
if apply_rotary_emb_func is not None and t.is_cuda:
t_ = t.float()
cos = cos.squeeze(0).squeeze(1)[:, : cos.shape[-1] // 2]
sin = sin.squeeze(0).squeeze(1)[:, : sin.shape[-1] // 2]
output = apply_rotary_emb_func(t_, cos, sin).type_as(t)
return output
else:
rot_dim = freqs[0].shape[-1]
cos, sin = freqs
cos = cos.to(t.device)
sin = sin.to(t.device)
t_, t_pass_ = t[..., :rot_dim], t[..., rot_dim:]
t_ = t_.float()
t_pass_ = t_pass_.float()
t_ = (t_ * cos) + (rotate_half(t) * sin)
return torch.cat((t_, t_pass_), dim=-1).type_as(t)

`

WingsLong · 2023-12-07T08:06:07Z

也要的问题，自己量化训练后的模型报错了
Traceback (most recent call last):
File "/data/aigc/train_models/autogpt_quantize.py", line 36, in
model.quantize(examples)
File "/data/programs/python310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/programs/python310/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 359, in quantize
layer(layer_input, **additional_layer_inputs)
File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-Chat_novel1207/modeling_qwen.py", line 610, in forward
attn_outputs = self.attn(
File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-Chat_novel1207/modeling_qwen.py", line 432, in forward
query = apply_rotary_pos_emb(query, q_pos_emb)
File "/root/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-Chat_novel1207/modeling_qwen.py", line 1345, in apply_rotary_pos_emb
t_rot = (t_rot * cos) + (_rotate_half(t_rot) * sin)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

lvheyang · 2023-12-13T15:49:21Z

在autogptq 的官方示例下面

...
model = AutoGPTQForCausalLM.from_pretrained(...)

# 增加如下内容，强制模型转移到gpu
if torch.cuda.is_available():
    model.cuda()

# 以下省略

测试了这个方法，对我有效果

sunyclj · 2023-12-14T10:43:25Z

apply_rotary_pos_emb

改完可以量化了，加载量化后的权重推理又出现问题了，FileNotFoundError: Could not find model in ./lora_finetune/qwen_7b_chat_q，应该是少文件

sunyclj · 2023-12-14T10:56:57Z

在autogptq 的官方示例下面
...
model = AutoGPTQForCausalLM.from_pretrained(...)

# 增加如下内容，强制模型转移到gpu
if torch.cuda.is_available():
    model.cuda()

# 以下省略
测试了这个方法，对我有效果

请问量化之后，加载推理正常吗？我这边应该是缺少模型文件，报错：FileNotFoundError: Could not find model in ./lora_finetune/qwen_7b_chat_q
文件结构如下：

sunyclj · 2023-12-19T10:13:20Z

在autogptq 的官方示例下面
...
model = AutoGPTQForCausalLM.from_pretrained(...)

# 增加如下内容，强制模型转移到gpu
if torch.cuda.is_available():
    model.cuda()

# 以下省略
测试了这个方法，对我有效果
请问量化之后，加载推理正常吗？我这边应该是缺少模型文件，报错：FileNotFoundError: Could not find model in ./lora_finetune/qwen_7b_chat_q 文件结构如下：

已解决，可量化并推理，但是推理输出效果低于官方开源的int4量化权重，暂未分析到原因

jklj077 · 2023-12-21T03:59:21Z

apply_rotary_emb这里报错，应该是AutoGPTQ会自己在device迁移tensor，但实现覆盖的不全，导致有些tensor没被迁移。参见AutoGPTQ/AutoGPTQ#370 (comment)。

但是推理输出效果低于官方开源的int4量化权重

参考以下回复哈

量化时的校准集选择 #617 (comment)

校准用的数据影响不能忽略的，需要跟应用场景同分布，GPTQ需要根据校准集最小化量化误差。

qazzombie · 2024-01-17T03:14:14Z

在autogptq 的官方示例下面
...
model = AutoGPTQForCausalLM.from_pretrained(...)

# 增加如下内容，强制模型转移到gpu
if torch.cuda.is_available():
    model.cuda()

# 以下省略
测试了这个方法，对我有效果
请问量化之后，加载推理正常吗？我这边应该是缺少模型文件，报错：FileNotFoundError: Could not find model in ./lora_finetune/qwen_7b_chat_q 文件结构如下：

你好，请问你这个少文件的问题怎么解决的呀，我的量化完之后也是没有模型文件

MiyazonoKaori mentioned this issue Nov 26, 2023

代码与示例一致，模型改成Qwen-7B-Chat，量化报错：ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)，请问是什么原因呢？ AutoGPTQ/AutoGPTQ#432

Open

jklj077 mentioned this issue Dec 20, 2023

[BUG] <title> Lora 微调 14B 模型后，转GPTQ 量化模型报错 #824

Closed

2 tasks

jklj077 closed this as completed Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen-7B-Chat模型按照Quto-GPTQ示例进行4bit量化，报错：ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)[BUG] <title> #646

Qwen-7B-Chat模型按照Quto-GPTQ示例进行4bit量化，报错：ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)[BUG] <title> #646

sunyclj commented Nov 17, 2023

lonngxiang commented Nov 21, 2023

jklj077 commented Nov 21, 2023

lonngxiang commented Nov 21, 2023

lonngxiang commented Nov 21, 2023

jklj077 commented Nov 21, 2023

lonngxiang commented Nov 21, 2023

sunyclj commented Nov 23, 2023

sunyclj commented Nov 23, 2023

MiyazonoKaori commented Nov 26, 2023 •

edited

Loading

WingsLong commented Dec 7, 2023

lvheyang commented Dec 13, 2023

sunyclj commented Dec 14, 2023

sunyclj commented Dec 14, 2023

sunyclj commented Dec 19, 2023

jklj077 commented Dec 21, 2023

qazzombie commented Jan 17, 2024

Qwen-7B-Chat模型按照Quto-GPTQ示例进行4bit量化，报错：ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)[BUG] <title> #646

Qwen-7B-Chat模型按照Quto-GPTQ示例进行4bit量化，报错：ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)[BUG] <title> #646

Comments

sunyclj commented Nov 17, 2023

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

lonngxiang commented Nov 21, 2023

jklj077 commented Nov 21, 2023

lonngxiang commented Nov 21, 2023

lonngxiang commented Nov 21, 2023

jklj077 commented Nov 21, 2023

lonngxiang commented Nov 21, 2023

sunyclj commented Nov 23, 2023

sunyclj commented Nov 23, 2023

MiyazonoKaori commented Nov 26, 2023 • edited Loading

WingsLong commented Dec 7, 2023

lvheyang commented Dec 13, 2023

sunyclj commented Dec 14, 2023

sunyclj commented Dec 14, 2023

sunyclj commented Dec 19, 2023

jklj077 commented Dec 21, 2023

qazzombie commented Jan 17, 2024

MiyazonoKaori commented Nov 26, 2023 •

edited

Loading