Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing the stage1 and stage2 Model problem on L40s #27

Closed
cydiachen opened this issue Feb 5, 2024 · 14 comments
Closed

Reproducing the stage1 and stage2 Model problem on L40s #27

cydiachen opened this issue Feb 5, 2024 · 14 comments

Comments

@cydiachen
Copy link

Thank you for your excellent job.
I followed your work and download the released dataset from your link.
Since you have kindly provided an end-to-end script and processed dataset file. I thought we can quickly reproduce your excellent work. But After two days of training, we get our LLaVA-phi2 model. It can infer by your code.

But It can not reproduce the excellent accuracy in your paper. Would you mind sharing any train logs or detailed information with us. Therefore, we can debug the training process and find out what happened.

@LinB203
Copy link
Member

LinB203 commented Feb 5, 2024

Do you mean the checkpoint of stage 2? We do not mention the results of stage 2 in paper. The result of table 3 you want to reproduce or table 7?

@cydiachen
Copy link
Author

Do you mean the checkpoint of stage 2? We do not mention the results of stage 2 in paper. The result of table 3 you want to reproduce or table 7?

Exactly. I carefully read your paper and find the relevant experimental result in Table.10 in the supplementary materials.
According to the 'phi-2' without MOE, the VQA^T and VQA^v2 scored 68.7 and 77.1. But our result scored 31 and 49. A large margin of the performance with the result.

@cydiachen
Copy link
Author

Addtionally, I think it is necessary to clarify our dataset used to reproduce.

  1. Stage1: --data_path ${JSON_FOLDER}/llava_image_.json
  2. Stage2: --data_path ${JSON_FOLDER}/la_tune_256k.json
    ${JSON_FOLDER}/lrv_tune_331k.json ${JSON_FOLDER}/lvis_tune_220k_.json
    ${JSON_FOLDER}/svit_tune_157k.json ${JSON_FOLDER}/nlp_tune.json \

@LinB203
Copy link
Member

LinB203 commented Feb 5, 2024

I think you misunderstood our paper. The LLaVA-phi in table 7 is not obtained by training with stage 2 data. Please refer to variant c of table 5 and the Effect of Training Strategy subsection to figure out the setup.

We did not validate the results of stage 2 with stage 2 data, but to make sure your results are consistent, we did just now. The result we got on textqa was 31.7 aligned with you.

By the way, if you want to get the better results, you can take the LLaVA-1.5 data (which is the stage 3 data in MoE-LLaVA) and train a non-MoE version. That would actually be an LLaVA-phi and have no connection to MoE-LLaVA.

@cydiachen
Copy link
Author

cydiachen commented Feb 5, 2024

I think you misunderstood our paper. The LLaVA-phi in table 7 is not obtained by training with stage 2 data. Please refer to variant c of table 5 and the Effect of Training Strategy subsection to figure out the setup.

We did not validate the results of stage 2 with stage 2 data, but to make sure your results are consistent, we did just now. The result we got on textqa was 31.7 aligned with you.

By the way, if you want to get the better results, you can take the LLaVA-1.5 data (which is the stage 3 data in MoE-LLaVA) and train a non-MoE version. That would actually be an LLaVA-phi and have no connection to MoE-LLaVA.

Thx a lot. This project is solid and open to the community. I will keep in touch with you to further explore the protential of the method.

@cydiachen
Copy link
Author

@LinB203
Hello, Lin. I am now working on integrating MIniCPM LLM with your work. Since the MiniCPM shares a large similarity with Phi-2. I followed the phi-2 pipeline and implement the whole pipeline. The model succeed in loading parameter correctly, but the stage1 pretraining suffer from large loss (~5). Is this phenomenon normal for the llm backbone?

@LinB203
Copy link
Member

LinB203 commented Feb 8, 2024

We have actually finished training MoE-LLaVA-minicpm. we provide all three stages train_state.json for reference. Please feel free to open a new issue if you have one.
stage3.json
stage1.json
stage2.json

@cydiachen
Copy link
Author

We have actually finished training MoE-LLaVA-minicpm. we provide all three stages train_state.json for reference. Please feel free to open a new issue if you have one. stage3.json stage1.json stage2.json

Thank you. My init loss is the same with you. I will reopen a new issue if more questions are met.

@LinB203
Copy link
Member

LinB203 commented Feb 8, 2024

We have actually finished training MoE-LLaVA-minicpm. we provide all three stages train_state.json for reference. Please feel free to open a new issue if you have one. stage3.json stage1.json stage2.json

Thank you. My init loss is the same with you. I will reopen a new issue if more questions are met.

Btw, we are training on 384×384 resolution. So the final loss maybe a little different.

As the json shows, the loss rises dramatically in the last few steps causing the last saved checkpoint to be unavailable. So I suggest you can save more checkpoints during the process. e.g. if you train 5198 steps in total, maybe 5000 steps will be much better than the last.

This seems to be a problem caused by minicpm, I haven't encountered it in other models.

@cydiachen
Copy link
Author

Btw, we are training on 384×384 resolution. So the final loss maybe a little different.

As the json shows, the loss rises dramatically in the last few steps causing the last saved checkpoint to be unavailable. So I suggest you can save more checkpoints during the process. e.g. if you train 5198 steps in total, maybe 5000 steps will be much better than the last.

This seems to be a problem caused by minicpm, I haven't encountered it in other models.

I am currently working on 336x336 resolution. I didn't came up with the phenomenon of increase loss on the end.
Unlucky , I met another problem. After Stage-2, my loss is exactly the same with you.
But when I intended to evaluate the whole result on TextVQA. The model seems to output endless and repeated results. The tokenizer and conversation template are aligned with llama-2, which might be align with MiniCPM.

Canon ODADADADADADADA                                                                                                                                                  
  0%|                                                                                                                               | 1/5000 [00:06<8:34:01,  6.17s/it]
OCRupupupupupupupupupupD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D F ACRupupup" The small small small small small small s
mall small small small small small small small small a C RUP                                                                                                           
  0%|                                                                                                                               | 2/5000 [00:10<7:10:31,  5.17s/it]
ThisESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTESTEST                                                          
  0%|                                                                                                                               | 3/5000 [00:11<4:43:20,  3.40s/it]
No Single Single Single Single OCR                                                                                                                                     
  0%|                                                                                                                               | 4/5000 [00:16<5:18:11,  3.82s/it]
The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The
 The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The Th
e The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The T
he The The                                                                                                                                                             
  0%|▏                                                                                                                              | 5/5000 [00:20<5:37:25,  4.05s/it]
Number Number Number                                                      2,,,,,,,,,,,,,,,2, O,,,,2, O,, a player from the baseball baseball baseball player from the "
2,, O,, a                2, a player from the "2,                                                                                                                      
  0%|▏                                                                                                                              | 6/5000 [00:25<5:48:18,  4.18s/it]
The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The
 The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The Th
e The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The The T
he The The                                                                                                                                                             
  0%|▏                                                                                                                              | 7/5000 [00:29<5:55:32,  4.27s/it]
RoleEPEPEP OCR                                                                                                                                                         
  0%|▏                                                                                                                              | 8/5000 [00:34<5:59:31,  4.32s/it]
AITITOCR                                                                                                                                                               
  0%|▏                                                                                                                              | 9/5000 [00:38<6:09:30,  4.44s/it]
The Phot Phot Phot Phot Phot Phot Phot Phot Phot Phot Phot Phot L L L L L L L L ACR                                                                                    
  0%|▎                                                                                                                             | 10/5000 [00:43<6:09:02,  4.44s/it]
OffCR                                                                                                                                                                  
  0%|▎                                                                                                                             | 11/5000 [00:47<6:10:14,  4.45s/it]
OCR Honey Honey Honey Honey
  0%|▎                                                                                                                             | 12/5000 [00:52<6:11:34,  4.47s/it]
The OCRCR
  0%|▎                                                                                                                             | 13/5000 [00:56<6:12:15,  4.48s/it]
Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk Sk
  0%|▎                                                                                                                             | 14/5000 [01:01<6:12:51,  4.49s/it]
  0%|▎                                                                                                                             | 14/5000 [01:01<6:07:58,  4.43s/it]

The implementation of my MINICPM template are as follows.

conv_minicpm = Conversation(
    system="You are a helpful language and vision assistant. "
           "You are able to understand the visual content that the user provides, "
           "and assist the user with a variety of tasks using natural language.",
    roles=("USER", "ASSISTANT"),
    version="minicpm",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.LLAMA_2,
    sep="<s>",
    sep2="</s>",
)

@cydiachen cydiachen reopened this Feb 8, 2024
@LinB203
Copy link
Member

LinB203 commented Feb 9, 2024

Here is my conv template.

conv_minicpm = Conversation(
    system="A chat between a curious user and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
    roles=("USER", "ASSISTANT"),
    version="minicpm",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.TWO,
    sep=" ",
    sep2="</s>",
)

@bug-fixed
Copy link

bug-fixed commented Feb 19, 2024

@LinB203 Hi Lin, thanks for your great work and thoughtful interactions.
I have a question, from the finetune_moe.sh,

https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/c7a5a42efe8dbd092d1c8e51e6265996f5a138b8/scripts/v1/phi2/finetune_moe.sh#L16C26-L16C62

The final MoE-LLaVA is finetuned from a Stage 2 finetuned checkpoint. I finetuned one Stage 2 checkpoint as your shared in https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/phi2/finetune.sh. The result is

yes/no: 78.6
number: 21.48
other: 41.93
overall: 54.72

I'm not sure if this result is reasonable. Would you please share some evaluation metrics of this checkpoint on the VQAv2 dataset? It would be much appreciated if you could share these checkpoints. Thanks.

@cydiachen
Copy link
Author

@LinB203 Hi Lin, thanks for your great work and thoughtful interactions. I have a question, from the finetune_moe.sh,

https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/c7a5a42efe8dbd092d1c8e51e6265996f5a138b8/scripts/v1/phi2/finetune_moe.sh#L16C26-L16C62

The final MoE-LLaVA is finetuned from a Stage 2 finetuned checkpoint. I finetuned one Stage 2 checkpoint as your shared in https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/phi2/finetune.sh. The result is

yes/no: 78.6
number: 21.48
other: 41.93
overall: 54.72

I'm not sure if this result is reasonable. Would you please share some evaluation metrics of this checkpoint on the VQAv2 dataset? It would be much appreciated if you could share these checkpoints. Thanks.

You can find an accuracy score in the results. You can check them.
In addition, You can evaluate your model offline on TextVQA dataset.
In my reproduction(More gradient accumulation), the VQA-v2 and textvqa score is slightly below the report. But the difference is within 1%.

@bug-fixed
Copy link

@LinB203 Hi Lin, thanks for your great work and thoughtful interactions. I have a question, from the finetune_moe.sh,
https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/c7a5a42efe8dbd092d1c8e51e6265996f5a138b8/scripts/v1/phi2/finetune_moe.sh#L16C26-L16C62
The final MoE-LLaVA is finetuned from a Stage 2 finetuned checkpoint. I finetuned one Stage 2 checkpoint as your shared in https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/phi2/finetune.sh. The result is

yes/no: 78.6
number: 21.48
other: 41.93
overall: 54.72

I'm not sure if this result is reasonable. Would you please share some evaluation metrics of this checkpoint on the VQAv2 dataset? It would be much appreciated if you could share these checkpoints. Thanks.

You can find an accuracy score in the results. You can check them. In addition, You can evaluate your model offline on TextVQA dataset. In my reproduction(More gradient accumulation), the VQA-v2 and textvqa score is slightly below the report. But the difference is within 1%.

Hi @cydiachen , many thanks for your kind reply and your shared information. Greatly appreciated!
I run the evaluation on the TextVQA dataset and get a score of 33% (before MoE) and 47% (after MoE). Is this result reasonable? I checked your previous comments in this thread, our results seemed similar on this dataset.
But in Table 10 of the paper, the score is 67.8% (without MoE) and 68.7% (with MoE), which had a large higher margin to our results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants