Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2VL+Lora微调后合并模型丢失文件bug #5749

Closed
1 task done
gxlover0625 opened this issue Oct 19, 2024 · 14 comments · Fixed by #5857
Closed
1 task done

Qwen2VL+Lora微调后合并模型丢失文件bug #5749

gxlover0625 opened this issue Oct 19, 2024 · 14 comments · Fixed by #5857
Labels
solved This problem has been already solved

Comments

@gxlover0625
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

[2024-10-19 11:31:58,444] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-5.10.134-010.ali5000.al8.x86_64-x86_64-with-glibc2.32
  • Python version: 3.10.15
  • PyTorch version: 2.3.1+cu118 (GPU)
  • Transformers version: 4.45.2
  • Datasets version: 2.21.0
  • Accelerate version: 0.34.2
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA H20
  • DeepSpeed version: 0.15.2

Reproduction

按照官网执行Qwen2VL+Lora微调代码

llamafactory-cli train examples/train_lora/qwen2vl_lora_sft.yaml
llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft.yaml

其中的qwen2vl_lora_sft.yaml如下

### model
model_name_or_path: /home/admin/workspace/aop_lab/llm/qwen/Qwen2-VL-7B-Instruct

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: paco_all  # video: mllm_video_demo
template: qwen2_vl
cutoff_len: 1024
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /home/admin/workspace/aop_lab/sft_results/paco_all/qwen2_vl-7b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

qwen2vl_lora_sft.yaml如下

### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
model_name_or_path: /home/admin/workspace/aop_lab/llm/qwen/Qwen2-VL-7B-Instruct
adapter_name_or_path: /home/admin/workspace/aop_lab/sft_results/paco_all/qwen2_vl-7b/lora/sft
template: qwen2_vl
finetuning_type: lora

### export
export_dir: /home/admin/workspace/aop_lab/save_models/paco_all/qwen2_vl_lora_sft
export_size: 2
export_device: cpu
export_legacy_format: false

Expected behavior

Bug

微调过程没有出错,在合并lora的时候也没有报错,但是观察原始文件和微调模型文件发现丢失chat_template.json,具体如下

  • 这是原始模型文件夹
    image
  • 这是微调后模型文件夹
    image
    经过两张图的对比,发现合并后的文件夹中丢失了以下文件
  • chat_template.json
  • configuration.json
    多了以下文件
  • added_tokens.json
  • speicial_tokens_map.json

其中的chat_template.json是很重要的文件,模型缺失了这个就无法推理报错
官方推理的推理代码

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_dir)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

报错如下

ValueError: No chat template is set for this processor. Please either set the `chat_template` attribute, or provide a chat template as an argument. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information

Others

额外信息

  1. Qwen2VL是7B模型,在modelscope官网下的,使用的是以下代码
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")
@github-actions github-actions bot added the pending This problem is yet to be addressed label Oct 19, 2024
@gxlover0625
Copy link
Author

哪怕我把chat_template.json复制到了微调后模型后依然报错,怀疑是整个Processor保存有问题,只能通过processor.chat_template=xx强行设定才行

@aliencaocao
Copy link
Contributor

同样问题。现在只能用预训练的qwen/Qwen2-VL-7B-Instruct的processor,然后weight用自己的local file

@gxlover0625
Copy link
Author

同样问题。现在只能用预训练的qwen/Qwen2-VL-7B-Instruct的processor,然后weight用自己的local file

我现在用的方法如下,运行后看起来没问题,不过不知道对不对,是否正确加载了微调后的权重。

processor = AutoProcessor.from_pretrained("sft_dir")
processor.chat_template = "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"

@aliencaocao
Copy link
Contributor

这样也可以,你这个和权重没关系,权重是用automodel加载

@gxlover0625
Copy link
Author

这样也可以,你这个和权重没关系,权重是用automodel加载

好的感谢

@aliencaocao
Copy link
Contributor

感觉你还是open比较好,这个是一个明显的bug

@gxlover0625 gxlover0625 reopened this Oct 21, 2024
@Wiselnn570
Copy link

same issue

hiyouga added a commit that referenced this issue Oct 29, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Oct 29, 2024
@Syazvinski
Copy link

same issue

@gxlover0625
Copy link
Author

same issue

It has been fixed,and i have try new version of llamafactory for two-stage sft and there is no problems

@PangziZhang523
Copy link

用lora训练完的模型,merge以后用官方代码推理:processor = AutoProcessor.from_pretrained(model_path) 还是报错Exception: data did not match any variant of untagged enum ModelWrapper at line 757371 column 3

@gxlover0625
Copy link
Author

用lora训练完的模型,merge以后用官方代码推理:processor = AutoProcessor.from_pretrained(model_path) 还是报错Exception: data did not match any variant of untagged enum ModelWrapper at line 757371 column 3

我也碰到过这个问题,后来按照qwen2vl官网的安装引导重新开了一个环境,好像就没这个问题了。我的transformers版本是4.45.0,不知道能不能解决您的问题

@PangziZhang523
Copy link

感谢回复,我的transformers版本也是4.45.0.dev0,用官方代码测试官方的模型正常推理,把模型换成lora微调后的模型就出现这个问题了,看报错是这句processor = AutoProcessor.from_pretrained(model_path)出现的问题。同时看合并以后的文件是有chat_template.json的

@PangziZhang523
Copy link

llamafactory-cli webchat examples/inference/qwen2_vl.yaml是正常的吗?为什么我的会是这样的页面:
image

@binhoul
Copy link

binhoul commented Nov 18, 2024

用lora训练完的模型,merge以后用官方代码推理:processor = AutoProcessor.from_pretrained(model_path) 还是报错Exception: data did not match any variant of untagged enum ModelWrapper at line 757371 column 3

请问这个解决了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants