Issue running single gpu training script #229

Shidhanta95 · 2024-03-28T08:18:31Z

Shidhanta95
Mar 28, 2024

Hi, I am new to deep learning so apologies if the question may be very trivial. I am using a modified version of the train_one_gpu script to train the medsam model on a dataset. The first time I run the script I have no issues. But the second time I ran the script without making any changes I got the following error.
"RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0"

Passing the code and tensor dimensions to chatgpt and asking it to output the tensor sizes shows that there should be no mismatch with the tensor dimensions.

I am clueless as to why it runs the first time and then it doesnt run again. I have attached the screenshots of the first run and the error. If required I can share my script as well.

Shidhanta95 · 2024-03-28T10:09:24Z

Shidhanta95
Mar 28, 2024
Author

Solved : Issue was caused by python 3.7. Running it with python 3.10 fixed the issue.

0 replies

owenip · 2024-04-11T06:11:06Z

owenip
Apr 11, 2024

@Shidhanta95
I am facing the same problem but with Python 3.10. May I ask what is your training batch size?
"RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0"
This problem only occur when batch num > 1.

With batch =2, I reckon the problem could be coming from mask_decoder:
src = torch.repeat_interleave(image_embeddings, tokens.shape[0], dim=0)
tokens.shape: torch.Size([2, 7, 256])
image_embeddings.shape: torch.Size([2, 256, 64, 64])

with torch.repeat_interleave, the shape of src become src.shape: torch.Size([4, 256, 64, 64])

But dense_prompt_embeddings.shape: torch.Size([2, 256, 64, 64]), this cause the sizes of dimenstion 0 mismatch

@JunMa11 Sorry for tagging you out of the blue. I have spent hours on this matter with no success. Since you are the contributor of branch 0.1, perhaps you have encounted this problem before?
Thank you for your time.

4 replies

Shidhanta95 Apr 11, 2024
Author

@owenip
Hi , I was using a very small batch size of 1 or 2. I was suspecting repeated torch.repeat_interleave to be the culprit as well but it has been working fine for me since the python upgrade.
Ran it through chat-gpt as well because I am terrible at tensors( https://chat.openai.com/share/66904ee4-6497-4475-a911-efbf436248be ), but not sure how accurate the generated calculations have been.

owenip Apr 11, 2024

Thanks for the quick response and the interesting chat GPT log.
I will continue to investigate the issue. Probably gonna cross referencing the SAM implementation from Hugging face library

Shidhanta95 Apr 15, 2024
Author

Please do update the thread if you have any breakthrough

owenip Aug 16, 2024

Sorry for the late update. But I found that the size of dimenstion 0 problem has already been resolved by the modified mask decoder code in MedSAM repo. The reason I am having the error because I was training my own SAM model with original SAM repo mask decoder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue running single gpu training script #229

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issue running single gpu training script #229

Shidhanta95 Mar 28, 2024

Replies: 2 comments · 4 replies

Shidhanta95 Mar 28, 2024 Author

owenip Apr 11, 2024

Shidhanta95 Apr 11, 2024 Author

owenip Apr 11, 2024

Shidhanta95 Apr 15, 2024 Author

owenip Aug 16, 2024

Shidhanta95
Mar 28, 2024

Replies: 2 comments 4 replies

Shidhanta95
Mar 28, 2024
Author

owenip
Apr 11, 2024

Shidhanta95 Apr 11, 2024
Author

Shidhanta95 Apr 15, 2024
Author