Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Visual Prompt Encoder and Contrastive Alignment #85

Open
hao416 opened this issue Aug 23, 2024 · 52 comments
Open

About Visual Prompt Encoder and Contrastive Alignment #85

hao416 opened this issue Aug 23, 2024 · 52 comments

Comments

@hao416
Copy link

hao416 commented Aug 23, 2024

Hello, authors. I would like to ask two questiones. 1. How to deal with box query feature and point query feature after deformable cross-
attention, contact? 2. How to get corresponding text prompts embedding from [CLS] token output, such as "cat", "dog"

@Mountchicken
Copy link
Collaborator

Hi @hao416
Sorry for the late reply. During training, we will train box prompt and point prompt in different iterations, i.e. they will not be used at the same time. 2. CLIP will add a [CLS] token to the input sentence as default and we can extract the feature of [CLS] token at the output of CLIP.

@hao416
Copy link
Author

hao416 commented Aug 24, 2024 via email

@Mountchicken
Copy link
Collaborator

Lets say we have four labels: a yellow dog, cat, person, a giant apple. We will pass these four phrases or category names to CLIP for four times, and get their corresponding text embeddings. Here is a brief example:

a yellow dog [CLS] -> CLIP -> [CLS]
cat [CLS] -> CLIP -> [CLS]
dog [CLS] -> CLIP -> [CLS]
a giant apple [CLS] -> CLIP -> [CLS]

We concat these four text embeddings to get a tensor of shape 4XC and use them for loss computation

@hao416
Copy link
Author

hao416 commented Aug 24, 2024 via email

@Mountchicken
Copy link
Collaborator

indeed, we need to pad image1 to 5 prompts

@hao416
Copy link
Author

hao416 commented Aug 24, 2024 via email

@hao416
Copy link
Author

hao416 commented Aug 26, 2024 via email

@Mountchicken
Copy link
Collaborator

Q1: K is the number of categories, and in your case K = 2.
Q2: If the 2 cats or 3 dogs are from one image, they will be 'averaged' by taking the aggregator token as output. If they are from different images, they will be averaged by calculating the mean value

@hao416
Copy link
Author

hao416 commented Aug 26, 2024 via email

@Mountchicken
Copy link
Collaborator

During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings.

@hao416
Copy link
Author

hao416 commented Aug 26, 2024 via email

@Mountchicken
Copy link
Collaborator

  1. Given an image, if there are M categories, we will finally get M visual prompt embeddings for each category.
  2. K is not a hyperparameter. It's the number of categories in current image. If you are using batch training, K will be the largest number of categories in this batch
  3. The content embedding is in 1XD dimension and will be copied for M times to get a MXD tensor.

@hao416
Copy link
Author

hao416 commented Aug 27, 2024

OK, I read the paper and your replies again and I have understanded answer 1 and 2. Lastly, I want to make sure the form of the content embedding. In my codes, I set the content embedding = nn.Embedding(1, 256). 1. I get final vector from outputs after (msdeformattn->self attn ->ffn), namely query[:, -1, :] 2. I only copy (content embedding.weight) M times after (msdeformattn->self attn ->ffn). Thanks

@Mountchicken
Copy link
Collaborator

Here is an example. Say there are three boxes selected to get the visual prompt embedding for dog. Then you will first broadcast the content embedding for three times, and concat it with the aggregator. This will get you a 4x256 tensor. Together with position embeddings, they will pass through deform -> self attn -> ffn. And lastly, the output at the aggregator position will be used as the final visual prompt embedding.

@hao416
Copy link
Author

hao416 commented Aug 27, 2024

Ok, I got it. Thank you very much!!!

@hao416
Copy link
Author

hao416 commented Aug 28, 2024

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

@hao416
Copy link
Author

hao416 commented Sep 3, 2024

Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset.

@Mountchicken
Copy link
Collaborator

Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset.

We concatenate those datasets into one for training.

@Mountchicken
Copy link
Collaborator

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

We don't have a special process for the category id and we simply reuse the original id in its dataset.

@hao416
Copy link
Author

hao416 commented Sep 3, 2024

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

We don't have a special process for the category id and we simply reuse the original id in its dataset.

Ok, thanks. I notice that you use denoising training in the paper which is associated with class_num and the original id in its dataset. You know, dino's label_enc=nn.embedding(dn_labelbook_size + 1, hidden_dim). Suppose I have 2 datasets A(10 categories) and B(20 categories), do you set label_enc=nn.embedding(30 + 1, hidden_dim)? And then, if in A, id 1 is person, and in B id 1 is table, how do you deal with it? Fuse 2 datasets and sort these categories from 0 to 29? Thanks

@Mountchicken
Copy link
Collaborator

Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss.

@hao416
Copy link
Author

hao416 commented Sep 3, 2024

Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss.

ok, thanks

@hao416
Copy link
Author

hao416 commented Sep 3, 2024

sorry, I have another question. Features of [CLS] token in your text model is features of [EOS] token in original CLIP paper?

@Mountchicken Mountchicken reopened this Sep 4, 2024
@Mountchicken
Copy link
Collaborator

Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this:

model = CLIPTextModel.from_pretrained(pretrained_name)
outputs = model(**inputs)
pooled_feature = outputs.pooler_output

@hao416
Copy link
Author

hao416 commented Sep 4, 2024 via email

@hao416
Copy link
Author

hao416 commented Sep 5, 2024

Dear author, in visual prompt encoder, I define parameters of box and point prompts. I notice that you say you train box prompt and point prompt in different iterations, but I have problems with torch when I use multiple gpus. It shows that some parameters do not receive gradients. So my question is that I need to freeze some weights in different iterations? Thanks

@Mountchicken
Copy link
Collaborator

Hi @hao416
There are two solutions. The first one is to set find_unused_parameters=True. Here is an example

model = torch.nn.parallel.DistributedDataParallel(
            model,
            device_ids=[args.gpu],
            find_unused_parameters=True)

The second one is to add the parameter of the unused module to the computation. Here is an example

box_embedding_layer = nn.Linear(4, 256)
point_embedding_layer = nn.Linear(2,256)
# for box iteration
embedding = box_embedding_layer(box)
for param in point_embedding_layer.parameters():
     embedding = embedding+ param.sum() * 0.0
# for point iteration
embedding = point_embedding_layer(point)
for param in box_embedding_layer.parameters():
     embedding = embedding+ param.sum() * 0.0

@hao416
Copy link
Author

hao416 commented Sep 5, 2024 via email

@CatfishW
Copy link

def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten):
Q = self.content_embedding.weight[None, :]
#expand to the same size as support_feat
Q = Q.expand(support_feat.shape[0], support_feat.shape[1], support_feat.shape[2])
Q_ = self.cross_attention_vp(self.with_pos_embed(Q.transpose(0,1),support_feat.transpose(0,1)), memory.transpose(0,1), memory.transpose(0,1),query_mask_flatten)[0].transpose(0,1)
Q = Q + self.cross_attention_vp_dropout(Q_)
Q = self.cross_attention_vp_norm(Q)
q = k = self.with_pos_embed(Q, support_feat)
Q_, _ = self.self_attn(q, k, value=Q, attn_mask=None)
Q = Q + self.dropout_post(Q_)
support_feat = self.norm_post(Q)
return support_feat
作者大大好,我照着你的结构复现了一部分内容,可以麻烦帮忙看看这个关于cross attention提取提示特征的函数写的对吗

@CatfishW
Copy link

def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten): Q = self.content_embedding.weight[None, :] #expand to the same size as support_feat Q = Q.expand(support_feat.shape[0], support_feat.shape[1], support_feat.shape[2]) Q_ = self.cross_attention_vp(self.with_pos_embed(Q.transpose(0,1),support_feat.transpose(0,1)), memory.transpose(0,1), memory.transpose(0,1),query_mask_flatten)[0].transpose(0,1) Q = Q + self.cross_attention_vp_dropout(Q_) Q = self.cross_attention_vp_norm(Q) q = k = self.with_pos_embed(Q, support_feat) Q_, _ = self.self_attn(q, k, value=Q, attn_mask=None) Q = Q + self.dropout_post(Q_) support_feat = self.norm_post(Q) return support_feat 作者大大好,我照着你的结构复现了一部分内容,可以麻烦帮忙看看这个关于cross attention提取提示特征的函数写的对吗

image

@Mountchicken
Copy link
Collaborator

@CatfishW
Sorry for the late reply. The implementation looks fine to me. As for detailed implementation, you can refer to this code:
https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/models/GroundingDINO/transformer.py#L802

@hao416
Copy link
Author

hao416 commented Oct 18, 2024

Dear author, I have a question that do you freeze weights of the text prompt encoder during training, thanks

@Mountchicken
Copy link
Collaborator

Hi @hao416
We don't freeze the CLIP text encoder during training.

@hao416
Copy link
Author

hao416 commented Oct 18, 2024

Hi @hao416 We don't freeze the CLIP text encoder during training.

OK,thanks for your reply, but I have two questiones:

  1. Recent works show that if models do not freeze CLIP text encoder, it may perturbmodel weights and interfere final performance. So do you study corresponding impacts?
  2. I notice that you ever mentioned 8 epoches for visual prompt and 1 epoch for text prompt. Now I try to reproduce your model but I change this setting by 4 epoches for visual prompt and 1 epoch for text prompt limited by the number of GPU devices. I find that results of text prompts can not be improved based on that of visual prompt. For example, I assume mAP is 18.0 after 4 epoches with visual prompts, but mAP may be 11.0 after first epoch with text prompts, namely 5th epoch in total numbers. It seems like that the whole model is trained from scratch. Is this normal?How many epoches you use?
    Thanks

@Mountchicken
Copy link
Collaborator

  1. We tried freezing the clip and fine-tuning the clip, and found that there was no particular difference between the two, and that fine-tuning performs a little better.

  2. We are training with 8 iterations of visual prompts and then one iteration of text prompts instead of epoch.

@hao416
Copy link
Author

hao416 commented Oct 18, 2024

2. s

OK, I misunderstood it before and I got it now. So I only need to choose specific prompts such as visual or text prompts to get final detection results at test time, right?

@Mountchicken
Copy link
Collaborator

Yes. During inference you can you either text prompt or visual prompt

@hao416
Copy link
Author

hao416 commented Oct 18, 2024 via email

@hao416
Copy link
Author

hao416 commented Oct 22, 2024

Yes. During inference you can you either text prompt or visual prompt

Dear author, I want to consult a question about training. I give an example to demonstrate it : You mentioned o365 /Goldg datesets for text prompt traing and o365/openimages for visual prompt training in paper. The question is that when train for text prompts in o365, the model has missed the first 8 iterations images for text prompt traing because they are trained for visual prompt traing. I want to know how to cope with different iterations in one forward process.

In dino framework, how to deal with it in for loop, namely :
for samples, targets in metric_logger.log_every(data_loader, print_freq, header, logger=logger):
xxxxxxxxxxxxxxxxx

thank you.

@Mountchicken
Copy link
Collaborator

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      
      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

@hao416
Copy link
Author

hao416 commented Oct 22, 2024

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      
      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

@hao416
Copy link
Author

hao416 commented Oct 23, 2024

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      
      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?

@CatfishW
Copy link

Hi, 作者大大
image
想请教您一下,For the deformable cross attention part, what should be the reference point for the [CLS] token?

@Mountchicken
Copy link
Collaborator

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      
      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?

You don't need to train all the text prompt data.

@Mountchicken
Copy link
Collaborator

Hi, 作者大大 image 想请教您一下,For the deformable cross attention part, what should be the reference point for the [CLS] token?

For the aggregator token in the visual prompt encoder, we use a box of image size (i.e. [0.5, 0.5, 1, 1], normalized xywh format) as the position embedding

@hao416
Copy link
Author

hao416 commented Oct 29, 2024

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      
      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?

You don't need to train all the text prompt data.
Ok, Thank you! I tried to train my model based on your strategy that I use 16H800 to train text prompt(3M images) first, but its map is only 0.6 after 2 epoches. I also used coco dataset to test this strategy, results are the same, but after 4 epoches, it has been improved, finally about 9.5 map(lr_drop is 6 ). In your paper, only text prompt without visual prompt, it can get 46.4 on coco. So I do not know whether this trend is normal? I really want to reproduce this model in my project, but I may encounter some problems.

@Mountchicken
Copy link
Collaborator

We first train the model only on text prompt data to get empower the model with basic text prompt detection capability. Then we joint train the visual prompt along with the text prompt. Maybe you need to first train on text prompt only

@hao416
Copy link
Author

hao416 commented Oct 29, 2024

We first train the model only on text prompt data to get empower the model with basic text prompt detection capability. Then we joint train the visual prompt along with the text prompt. Maybe you need to first train on text prompt only

ok, I know, I have got it from your previsous answers to other issues. Now , model is being trained with text prompt only. My question is how long or what level (such as mAP) it can meet the requirement of joint training. And another quetion, we can maintain a global dictionary to sample negative text prompts, but how to sample negative examples for visual prompt? I just cat visual aggregator embeddings from other images in current mini-batch. Thanks!

@Mountchicken
Copy link
Collaborator

In our experiment, we start the joint training when the text prompt can reach 45mAP on COCO; For visual prompts, we can only sample negative prompts from the current mini batch.

@hao416
Copy link
Author

hao416 commented Oct 29, 2024

In our experiment, we start the joint training when the text prompt can reach 45mAP on COCO; For visual prompts, we can only sample negative prompts from the current mini batch.
OK, thank you very much! ! !

@hao416
Copy link
Author

hao416 commented Oct 31, 2024

@Mountchicken Hi ,dear author, I'm sorry for I have a question for you. I have trained the model with text prompts only nearly 4 days, almost 4 entire epoches on O365, GoldG and Bamboo datasets. But zero shot mAP is only 5.3 on coco. The convegence is very slow. Is it normal? I notice that you say you only nearly 3 days on 8xA100. The model is composed of image encoder, clip-b text encoders, query selection layer proposed in Grounding DINO, and other dino components which is the same with dino. The loss includes cls + L1 + GIoU + DN(box), cls = contrastive loss. So do you use other approaches or details to train text prompt? I'm looking forward to your reply.Thanks!

@hao416
Copy link
Author

hao416 commented Oct 31, 2024

It seems that it takes a very long time to converge without interations or fusion between text and image features like fusion modules in Grounding DINO or other models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants