-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About Visual Prompt Encoder and Contrastive Alignment #85
Comments
Hi @hao416 |
Thanks,dear author. But I have another question. Grounding dino contacts many labels as input sentence, so trex2 uses the way? But I see that you said trex2 uese Phrase in the github issues. If you use sentence, I don't know how to locate corresponding label embeddings from cls token, because its size is 1x 516. But if you use distinct phrases, negative labels can't play a role. I'm sorry for my poor English. I expect your reply. Thanks again!
…---Original---
From: "Qing ***@***.***>
Date: Sat, Aug 24, 2024 17:57 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Hi @hao416
Sorry for the late reply. During training, we will train box prompt and point prompt in different iterations, i.e. they will not be used at the same time. 2. CLIP will add a [CLS] token to the input sentence as default and we can extract the feature of [CLS] token at the output of CLIP.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Lets say we have four labels:
We concat these four text embeddings to get a tensor of shape |
oh,thank,I understand it correctly! Sorry, I also have a question. In the paper, you say that model randomly selects from 1 to n gt boxes as visual prompts. Now, if I set batch size as 2, img1 and img2. From img1,model gets 3 visual prompts. From img2, model gets 5 visual prompts. So I pad img1 3 prompts to 5 prompts for batch operation or use python for loop to operate it 2 times. Thanks
…---Original---
From: "Qing ***@***.***>
Date: Sat, Aug 24, 2024 18:26 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Lets say we have four labels: a yellow dog, cat, person, a giant apple. We will pass these four phrases or category names to CLIP for four times, and get their corresponding text embeddings. Here is a brief example:
a yellow dog [CLS] -> CLIP -> [CLS] cat [CLS] -> CLIP -> [CLS] dog [CLS] -> CLIP -> [CLS] a giant apple [CLS] -> CLIP -> [CLS]
We concat these four text embeddings to get a tensor of shape 4XC and use them for loss computation
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
indeed, we need to pad image1 to 5 prompts |
OK,thanks for your replies. All of you did a great job. Best wishes!
…---Original---
From: "Qing ***@***.***>
Date: Sat, Aug 24, 2024 18:47 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
indeed, we need to pad image1 to 5 prompts
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks,dear author, I'm sorry that I may have last two questions. Here, I give an example. I get 2 "cat" prompts and 3 "dog" prompts. Question 1: in visual prompt encoder, K means total number of visual prompts(5) or categories(2). I see that in contrastive
loss, you said K means categories numbers in github issue. Question 2: visual prompts.will be used as weights in class predictions, so do I need to get a mean prompts of 2 cat prompts and a mean prompts of 3 dog prompts so that model makes sure get 2 class predictions. Thanks.
…---Original---
From: "Qing ***@***.***>
Date: Mon, Aug 26, 2024 10:28 AM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Closed #85 as completed.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Q1: K is the number of categories, and in your case K = 2. |
Ok, so you mean that if batch siz is 1, I only need to use aggregator token, namely a universal class token C' in your paper, as class prediction weighs. If batch size >1, I need to get every C' token and calculate mean values as final prediction weights. Right?
…---Original---
From: "Qing ***@***.***>
Date: Mon, Aug 26, 2024 11:10 AM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Q1: K is the number of categories, and in your case K = 2.
Q2: If the 2 cats or 3 dogs are from one image, they will be 'averaged' by taking the aggregator token as output. If they are from different images, they will be averaged by calculating the mean value
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings. |
Ok, thanks, author, I understand your reply, now I'm reproducing this work, so that I'm sorry I have many problems about details. Finally, I have some points unclear combined with your replies. I also give an example : 2 cat objects and 3 dog objects in an image.
1. In paper, you say " we randomly choose between one to all available GT boxes to use as.visual prompts". Now I suppose that I get 2 cat boxes and 2 dog boxes to generate visual prompts. In visual encoder, K means category numbers, so it means I need to sample a prompt randomly for each category(cat and dog) once again or all 4 prompts as inputs.
2. K is a fixed hyperparameter?
3. The learnable content embedding is broadcasted K times to KxD. I can't understand this "broadcast" clearly, it means original dimension of content embedding is 1xD?
…---Original---
From: "Qing ***@***.***>
Date: Mon, Aug 26, 2024 11:32 AM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
OK, I read the paper and your replies again and I have understanded answer 1 and 2. Lastly, I want to make sure the form of the content embedding. In my codes, I set the content embedding = nn.Embedding(1, 256). 1. I get final vector from outputs after (msdeformattn->self attn ->ffn), namely query[:, -1, :] 2. I only copy (content embedding.weight) M times after (msdeformattn->self attn ->ffn). Thanks |
Here is an example. Say there are three boxes selected to get the visual prompt embedding for dog. Then you will first broadcast the content embedding for three times, and concat it with the aggregator. This will get you a 4x256 tensor. Together with position embeddings, they will pass through deform -> self attn -> ffn. And lastly, the output at the aggregator position will be used as the final visual prompt embedding. |
Ok, I got it. Thank you very much!!! |
Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way? |
Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset. |
We concatenate those datasets into one for training. |
We don't have a special process for the category id and we simply reuse the original id in its dataset. |
Ok, thanks. I notice that you use denoising training in the paper which is associated with class_num and the original id in its dataset. You know, dino's label_enc=nn.embedding(dn_labelbook_size + 1, hidden_dim). Suppose I have 2 datasets A(10 categories) and B(20 categories), do you set label_enc=nn.embedding(30 + 1, hidden_dim)? And then, if in A, id 1 is person, and in B id 1 is table, how do you deal with it? Fuse 2 datasets and sort these categories from 0 to 29? Thanks |
Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss. |
ok, thanks |
sorry, I have another question. Features of [CLS] token in your text model is features of [EOS] token in original CLIP paper? |
Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this: model = CLIPTextModel.from_pretrained(pretrained_name)
outputs = model(**inputs)
pooled_feature = outputs.pooler_output |
OK, thank you very much.You help me a lot.
…---Original---
From: "Qing ***@***.***>
Date: Wed, Sep 4, 2024 22:04 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this:
model = CLIPTextModel.from_pretrained(pretrained_name) outputs = model(**inputs) pooled_feature = outputs.pooler_output
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Dear author, in visual prompt encoder, I define parameters of box and point prompts. I notice that you say you train box prompt and point prompt in different iterations, but I have problems with torch when I use multiple gpus. It shows that some parameters do not receive gradients. So my question is that I need to freeze some weights in different iterations? Thanks |
Hi @hao416 model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[args.gpu],
find_unused_parameters=True) The second one is to add the parameter of the unused module to the computation. Here is an example box_embedding_layer = nn.Linear(4, 256)
point_embedding_layer = nn.Linear(2,256)
# for box iteration
embedding = box_embedding_layer(box)
for param in point_embedding_layer.parameters():
embedding = embedding+ param.sum() * 0.0
# for point iteration
embedding = point_embedding_layer(point)
for param in box_embedding_layer.parameters():
embedding = embedding+ param.sum() * 0.0 |
Thanks, I got it.And I search answers in the internet, I find that it can freeze weights in different iterations. Additionally, I ever tried the first solution but it can't work well.Thank you again.
…---Original---
From: "Qing ***@***.***>
Date: Thu, Sep 5, 2024 19:11 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Hi @hao416
There are two solutions. The first one is to set find_unused_parameters=True. Here is an example
model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True)
The second one is to add the parameter of the unused module to the computation. Here is an example
box_embedding_layer = nn.Linear(4, 256) point_embedding_layer = nn.Linear(2,256) # for box iteration embedding = box_embedding_layer(box) for param in point_embedding_layer.parameters(): embedding = embedding+ param.sum() * 0.0 # for point iteration embedding = point_embedding_layer(point) for param in box_embedding_layer.parameters(): embedding = embedding+ param.sum() * 0.0
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten): |
|
@CatfishW |
Dear author, I have a question that do you freeze weights of the text prompt encoder during training, thanks |
Hi @hao416 |
OK,thanks for your reply, but I have two questiones:
|
|
OK, I misunderstood it before and I got it now. So I only need to choose specific prompts such as visual or text prompts to get final detection results at test time, right? |
Yes. During inference you can you either text prompt or visual prompt |
OK,thank you very much
…---Original---
From: "Qing ***@***.***>
Date: Fri, Oct 18, 2024 18:32 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Yes. During inference you can you either text prompt or visual prompt
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Dear author, I want to consult a question about training. I give an example to demonstrate it : You mentioned o365 /Goldg datesets for text prompt traing and o365/openimages for visual prompt training in paper. The question is that when train for text prompts in o365, the model has missed the first 8 iterations images for text prompt traing because they are trained for visual prompt traing. I want to know how to cope with different iterations in one forward process. In dino framework, how to deal with it in for loop, namely : thank you. |
In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way: iter = 0
for visual_batch in visual_loader:
loss = model(visual_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if iter % 8 == 0:
text_batch = next(text_loader)
loss = model(text_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
iter += 1 |
okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader? |
And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template? |
You don't need to train all the text prompt data. |
|
We first train the model only on text prompt data to get empower the model with basic text prompt detection capability. Then we joint train the visual prompt along with the text prompt. Maybe you need to first train on text prompt only |
ok, I know, I have got it from your previsous answers to other issues. Now , model is being trained with text prompt only. My question is how long or what level (such as mAP) it can meet the requirement of joint training. And another quetion, we can maintain a global dictionary to sample negative text prompts, but how to sample negative examples for visual prompt? I just cat visual aggregator embeddings from other images in current mini-batch. Thanks! |
In our experiment, we start the joint training when the text prompt can reach 45mAP on COCO; For visual prompts, we can only sample negative prompts from the current mini batch. |
|
@Mountchicken Hi ,dear author, I'm sorry for I have a question for you. I have trained the model with text prompts only nearly 4 days, almost 4 entire epoches on O365, GoldG and Bamboo datasets. But zero shot mAP is only 5.3 on coco. The convegence is very slow. Is it normal? I notice that you say you only nearly 3 days on 8xA100. The model is composed of image encoder, clip-b text encoders, query selection layer proposed in Grounding DINO, and other dino components which is the same with dino. The loss includes cls + L1 + GIoU + DN(box), cls = contrastive loss. So do you use other approaches or details to train text prompt? I'm looking forward to your reply.Thanks! |
It seems that it takes a very long time to converge without interations or fusion between text and image features like fusion modules in Grounding DINO or other models. |
Hello, authors. I would like to ask two questiones. 1. How to deal with box query feature and point query feature after deformable cross-
attention, contact? 2. How to get corresponding text prompts embedding from [CLS] token output, such as "cat", "dog"
The text was updated successfully, but these errors were encountered: