Fine-tune the CogVLM2 model

Note

This code only provides fine-tuning examples for the huggingface version model 'cogvlm2-llama3-chat-19B'.
Only examples of fine-tuning language models are provided.
Only provide Lora fine-tuning examples.
Only provide examples of fine-tuning the dialogue model.
We currently do not support using 'zero3' fine-tuning, which may result in the model not being able to read.

Minimum configuration

We Only test A100 GPUs with 80GB memory for finetune. It requires at least 73GB of GPU memory using 8 GPUs with zero2.
Tensor parallelism is not supported yet, that is, the model is split into multiple graphics cards for fine-tuning.

Start fine-tuning

Download the data set and install dependencies

In this demo, developers can use the CogVLM-SFT-311K open source data set provided by us or build their own data set in the same format for fine-tuning. .

The data format is as follows:

The data set consists of two folders, images and labels (in CogVLM-SFT-311K, they are labels_en and labels_zh, corresponding to Chinese and English labels respectively. In the fine-tuning code, you can modify these two lines of code to modify the folder name.

self.image_dir = os.path.join(root_dir, 'images')
self.label_dir = os.path.join(root_dir, 'labels_en')  # or 'labels_zh' or 'labels' can be modified by yourself

Image files are stored in the images folder, and corresponding label files are stored in the labels folder. There is a one-to-one correspondence between the names of pictures and label files. The format of image files is jpg, and the format of label files is json.
Each tag file contains a dialogue. The dialogue consists of two roles: user and assistant. The dialogue content of each role consists of two fields: role and content. As shown in the fields below.

{
   "conversations": [
     {
       "role": "user",
       "content": "What can be inferred about the zebras' behavior and surroundings?"
     },
     {
       "role": "assistant",
       "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat."
     }
   ]
}

Before starting fine-tuning, you need to install the relevant dependencies. you also need to install the dependencies in the basic_demo.

pip install -r requirements.txt

Note: mpi4py may need to install other Linux dependency packages. Please install it yourself according to your system environment.

Run the fine-tuning program

We provide a fine-tuning script peft_lora.py that uses multiple cards on a single machine (including a single card). You can start fine-tuning by running the following command.

deepspeed peft_lora.py --ds_config ds_config.yaml

The figure below shows the memory usage during fine-tuning.

Parameter information:

max_input_len: 512
max_output_len: 512
batch_size_per_gpus: 1
lora_target: vision_expert_query_key_value

GPU memory usage:

+-------------------------------------------------------------+
| Processes:                                                  |
|  GPU   GI   CI        PID   Type   Process name  GPU Memory |
|        ID   ID                                      Usage   |
|=============================================================|
|    0   N/A  N/A    704914      C   python          72442MiB |
|    1   N/A  N/A    704915      C   python          72538MiB |
|    2   N/A  N/A    704916      C   python          72538MiB |
|    3   N/A  N/A    704917      C   python          72538MiB |
|    4   N/A  N/A    704918      C   python          72538MiB |
|    5   N/A  N/A    704919      C   python          72538MiB |
|    6   N/A  N/A    704920      C   python          72538MiB |
|    7   N/A  N/A    704921      C   python          72442MiB |
+-------------------------------------------------------------+

While the code is running, Loss data will be recorded by tensorboard to facilitate visual viewing of Loss convergence.

tensorboard --logdir=output

Note: We strongly recommend that you use the BF16 format for fine-tuning to avoid the problem of Loss being NaN.

Inference on the fine-tuned model

By running peft_infer.py you can use the fine-tuned model to generate text. You need to configure the fine-tuned model address according to the configuration requirements in the code. Then run:

python peft_infer.py

You can use the fine-tuned model for inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Fine-tune the CogVLM2 model

Note

Minimum configuration

Start fine-tuning

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fine-tune the CogVLM2 model

Note

Minimum configuration

Start fine-tuning