Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple source images #20

Open
Akhp888 opened this issue Mar 13, 2025 · 27 comments
Open

Multiple source images #20

Akhp888 opened this issue Mar 13, 2025 · 27 comments
Assignees

Comments

@Akhp888
Copy link

Akhp888 commented Mar 13, 2025

Is it possible to have multiple source images with their corresponding boxes be used for inferring on multiple images ?

@jameslahm
Copy link
Collaborator

Thanks for your interest! We experimentally support passing multiple source images with corresponding boxes for inferring in this branch https://github.com/THU-MIG/yoloe/tree/multi-source-predict-vp. Could you please try it? Thanks!

@Akhp888
Copy link
Author

Akhp888 commented Mar 15, 2025

Thanks for your interest! We experimentally support passing multiple source images with corresponding boxes for inferring in this branch https://github.com/THU-MIG/yoloe/tree/multi-source-predict-vp. Could you please try it? Thanks!

Thank you for the reply James.

I see that in the branch you mentioned the reference as well is for a single image , Does it mean i should apply the predictor method/class in series over multiple images ?

source_image = 'ultralytics/assets/bus.jpg'

@jameslahm
Copy link
Collaborator

Hi, it actually predicts for multiple images (source_image and target_image) at once

model.predict([source_image, target_image], save=True, prompts=visuals, predictor=YOLOEVPSegPredictor)

@Akhp888
Copy link
Author

Akhp888 commented Mar 15, 2025

Not sure if i get it right , but do you mean i pass in like this ?
model.predict([source_image_1,source_image_2, target_image], save=True, prompts=visuals, predictor=YOLOEVPSegPredictor)

Since I have multiple images with different bboxes i would expect the logic to take in a list of dictionary ( where each dictionary has the list of bboxes for a particular image ) ?

@jameslahm
Copy link
Collaborator

jameslahm commented Mar 15, 2025

The format is like this.

  • visuals are in the format of Dict[bboxes, cls]
  • bboxes are in the format of List[List[Box]]
  • cls(class) are in the format of List[List[Cls]]
  • images are in the format of List[Image]

Each image can have a list of Box and Cls.

@Akhp888
Copy link
Author

Akhp888 commented Mar 15, 2025

The format is like this.

  • visuals are in the format of Dict[bboxes, cls]
  • bboxes are in the format of List[List[Box]]
  • cls(class) are in the format of List[List[Cls]]
  • images are in the format of List[Image]

Each image can have a list of Box and Cls.

Thanks,
This detailing was very helpful in understanding the ontology structure, I will try it out and let you know soon.

@Akhp888
Copy link
Author

Akhp888 commented Mar 15, 2025

The format is like this.

  • visuals are in the format of Dict[bboxes, cls]
  • bboxes are in the format of List[List[Box]]
  • cls(class) are in the format of List[List[Cls]]
  • images are in the format of List[Image]

Each image can have a list of Box and Cls.

This was very insightful, Thank you !
I could test on multiple source images, and it worked although the detection hampered a little with more source images (reference objects), I will continue a little more experiment before concluding.

Also the current validation expects the target image to have at least one bbox/class ( in the dict ) , Can this be excluded ?
As in if the target images have no prior bboxes still be able to run the predictor ?

@jameslahm
Copy link
Collaborator

jameslahm commented Mar 16, 2025

The target image in model.predict([source_image, target_image], ...) is actually one source image ;). For cross-image prompt, you could refer to

# Prompts in different images can be passed
# Please set a smaller conf for cross-image prompts
# model.predictor = None # remove VPPredictor
model.predict(source_image, prompts=visuals, predictor=YOLOEVPSegPredictor,
return_vpe=True)
model.set_classes(["object0", "object1"], model.predictor.vpe)
model.predictor = None # remove VPPredictor
model.predict(target_image, save=True)

@jameslahm
Copy link
Collaborator

I could test on multiple source images, and it worked although the detection hampered a little with more source images (reference objects), I will continue a little more experiment before concluding.

Would you mind sharing more details about the hampering detection? ;)

@Akhp888
Copy link
Author

Akhp888 commented Mar 16, 2025

I could test on multiple source images, and it worked although the detection hampered a little with more source images (reference objects), I will continue a little more experiment before concluding.

Would you mind sharing more details about the hampering detection? ;)

The Observed behavior was that when i added multiple source image it happens to miss some objects that it detected with only 1 image as source.

Note : I am under the assumption that with more images/annotations the predictions of target image will get better, which i hope is the case ?

@Akhp888
Copy link
Author

Akhp888 commented Mar 16, 2025

This is what i am trying to do

source_image = 'testing_data/source_image.png'
source_image_2 = 'testing_data/source_image_2 .png'
target_image = 'testing_data/target_image.png'

model.predict([source_image,source_image_2], save=True, prompts=visuals, predictor=YOLOEVPSegPredictor)

model.set_classes(["object0", "object1", "object3"], model.predictor.vpe)
model.predictor = None
model.predict(target_image, save=True)

I am basically looking for cross image prompt with multiple source images .

@jameslahm jameslahm assigned leonnil and unassigned leonnil Mar 19, 2025
@jameslahm
Copy link
Collaborator

The Observed behavior was that when i added multiple source image it happens to miss some objects that it detected with only 1 image as source.

Note : I am under the assumption that with more images/annotations the predictions of target image will get better, which i hope is the case ?

Yes, it is expected that the predictions will get better with more images/annotations.

This is what i am trying to do

source_image = 'testing_data/source_image.png'
source_image_2 = 'testing_data/source_image_2 .png'
target_image = 'testing_data/target_image.png'

model.predict([source_image,source_image_2], save=True, prompts=visuals, predictor=YOLOEVPSegPredictor)

model.set_classes(["object0", "object1", "object3"], model.predictor.vpe)
model.predictor = None
model.predict(target_image, save=True)

I am basically looking for cross image prompt with multiple source images .

Could you please try to set return_vpe=True in model.predict([source_image,source_image_2], save=True, prompts=visuals, predictor=YOLOEVPSegPredictor) so that model.predictor.vpe can be correctly saved? Thanks! Besides, would you mind sharing us with the test images so that we can reproduce it? Thanks!

@Ian-Work-AI
Copy link

Hi @jameslahm I have same question too.

Here is my code

visuals = dict(
    bboxes=[
        np.array([
            [78.0, 202.0, 130.0, 333.0]
        ]), 
        np.array([
            [240.0, 240.0, 268.0, 283.0]
        ])
    ]
    ,
    cls=[
        np.array([
            0
        ]), 
        np.array([
            0
        ])
    ]
)

source_image0 = "source00.jpg"
source_image1 = "source01.jpg"
target_image = "target.jpg"

# model.predictor = None  # remove VPPredictor
model.predict([source_image0, source_image1], save=True, prompts=visuals, 
              predictor=YOLOEVPSegPredictor, return_vpe=True)
model.set_classes(["object0"], model.predictor.vpe)
model.predictor = None  # remove VPPredictor
model.predict(target_image, save=True)

And here is error message

0: 640x640 1 object0, 105.6ms
1: 640x640 1 object0, 105.6ms
Speed: 5.1ms preprocess, 105.6ms inference, 424.7ms postprocess per image at shape (1, 3, 640, 640)
Results saved to runs/segment/predict8
Traceback (most recent call last):
  File "/media/ian/disk/Ian/playground/yoloe/predict_visual_prompt_test.py", line 36, in <module>
    model.predict(target_image, save=True)
  File "/media/ian/disk/Ian/playground/yoloe/ultralytics/engine/model.py", line 551, in predict
    self.predictor.setup_model(model=self.model, verbose=is_cli)
  File "/media/ian/disk/Ian/playground/yoloe/ultralytics/engine/predictor.py", line 307, in setup_model
    self.model = AutoBackend(
  File "/media/ian/disk/Ian/playground/yoloe_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/media/ian/disk/Ian/playground/yoloe/ultralytics/nn/autobackend.py", line 148, in __init__
    model = model.fuse(verbose=verbose)
  File "/media/ian/disk/Ian/playground/yoloe/ultralytics/nn/tasks.py", line 233, in fuse
    m.fuse(self.pe.to(device))
  File "/media/ian/disk/Ian/playground/yoloe_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/media/ian/disk/Ian/playground/yoloe/ultralytics/nn/modules/head.py", line 443, in fuse
    conv.weight.data.copy_(w.unsqueeze(-1).unsqueeze(-1))
RuntimeError: output with shape [2, 256, 1, 1] doesn't match the broadcast shape [2, 2, 256, 1, 1]

@jameslahm
Copy link
Collaborator

@Ian-Work-AI Thanks for your interest! Due to that there are two source images, model.predictor.vpe will be the shape of [2,1,512] (The first dimension denotes the number of source images and the second dimension denotes the number of classes). So it is needed to be aggregated into one visual prompt embedding in the shape of [1,1,512]. If the box prompt refers to the object of the same category in two source images, we can the average their prompt embedding to obtain the final one like this

model.set_classes(["object0"], model.predictor.vpe.mean(dim=0, keepdim=True).normalize(dim=-1, p=2))

Could you please try it? Thanks!

@Ian-Work-AI
Copy link

@jameslahm

model.set_classes(["object0"], model.predictor.vpe.mean(dim=0, keepdim=True).normalize(dim=-1, p=2))
AttributeError: 'Tensor' object has no attribute 'normalize'. Did you mean: 'normal_'?

Should I change to np.array() before using normalize?

@jameslahm
Copy link
Collaborator

Sorry, could you please try to use torch.nn.functional.normalize? Thanks!

model.set_classes(["object0"], torch.nn.functional.normalize(model.predictor.vpe.mean(dim=0, keepdim=True), dim=-1, p=2))

@Ian-Work-AI
Copy link

@jameslahm Fantastic!!

Here is the result of one image:
Prompt Image
Image

Target Image
Image

and here is result of two images:
Prompt Image
Image
Image

Target Image
Image

@Akhp888
Copy link
Author

Akhp888 commented Mar 23, 2025

The Observed behavior was that when i added multiple source image it happens to miss some objects that it detected with only 1 image as source.

Note : I am under the assumption that with more images/annotations the predictions of target image will get better, which i hope is the case ?

Yes, it is expected that the predictions will get better with more images/annotations.

This is what i am trying to do
source_image = 'testing_data/source_image.png'
source_image_2 = 'testing_data/source_image_2 .png'
target_image = 'testing_data/target_image.png'
model.predict([source_image,source_image_2], save=True, prompts=visuals, predictor=YOLOEVPSegPredictor)
model.set_classes(["object0", "object1", "object3"], model.predictor.vpe)
model.predictor = None
model.predict(target_image, save=True)
I am basically looking for cross image prompt with multiple source images .

Could you please try to set return_vpe=True in model.predict([source_image,source_image_2], save=True, prompts=visuals, predictor=YOLOEVPSegPredictor) so that model.predictor.vpe can be correctly saved? Thanks! Besides, would you mind sharing us with the test images so that we can reproduce it? Thanks!

Thanks for the reply,

Actually, the data I am testing is sensitive hence can't be shared, but after a round of testing I think it's the limitation due to kind of data I am testing with, the objects that i am looking to detect are not the most common ones, it has lot more to detect from table kind of structures.
I am confirming this as when I use 1 image as source and predict using the same image as target as well the results are poor.

Source with groundtruth

Image

Target(same image as source) with prediction

Image

I am now wondering if training the pretrained net on similar image will improve the results ?

@Akhp888
Copy link
Author

Akhp888 commented Mar 23, 2025

@jameslahm Fantastic!!

Here is the result of one image: Prompt Image Image

Target Image Image

and here is result of two images: Prompt Image Image Image

Target Image Image

Good to see that it works for common objects pretty well 👍

I am planning to write an article on the module where these results would be helpful in defining YoloE's strength.

@Akhp888
Copy link
Author

Akhp888 commented Mar 23, 2025

Also Is it possible to substitute the pretrained model with an yolov8 object detection model instead of segmentation ?
I tried to replace the class with "from ultralytics.models.yolo.yoloe.predict_vp import YOLOEVPDetectPredictor" but it throws the error

AttributeError: 'DetectionModel' object has no attribute 'get_visual_pe'

@jameslahm
Copy link
Collaborator

jameslahm commented Mar 26, 2025

Hi @Akhp888 , thanks!

I am now wondering if training the pretrained net on similar image will improve the results ?

Yes, we thought that training the pretrained net on similar image can improve the results.

Also Is it possible to substitute the pretrained model with an yolov8 object detection model instead of segmentation ?
I tried to replace the class with "from ultralytics.models.yolo.yoloe.predict_vp import YOLOEVPDetectPredictor" but it throws the error
AttributeError: 'DetectionModel' object has no attribute 'get_visual_pe'

Hi, for detection only model, you could delete the segmentation part from models like yoloe-v8l-seg.pt like below.

model = YOLOE("yoloe-v8l.yaml")
model.load("yoloe-v8l-seg.pt")

Then, you could use YOLOEVPDetectPredictor for this model with only the detection.

@Akhp888
Copy link
Author

Akhp888 commented Mar 30, 2025

Hi @Akhp888 , thanks!

I am now wondering if training the pretrained net on similar image will improve the results ?

Yes, we thought that training the pretrained net on similar image can improve the results.

Also Is it possible to substitute the pretrained model with an yolov8 object detection model instead of segmentation ?
I tried to replace the class with "from ultralytics.models.yolo.yoloe.predict_vp import YOLOEVPDetectPredictor" but it throws the error
AttributeError: 'DetectionModel' object has no attribute 'get_visual_pe'

Hi, for detection only model, you could delete the segmentation part from models like yoloe-v8l-seg.pt like below.

model = YOLOE("yoloe-v8l.yaml")
model.load("yoloe-v8l-seg.pt")
Then, you could use YOLOEVPDetectPredictor for this model with only the detection.

Hi,
Thanks for that suggestion.
I tried to load a yolov8 pretrained model I had trained for similar kind of images and loaded as you mentioned, this time I didn't get any error, but the predictions are empty, none were detected.

below is the code for your reference:

#load the model
model = YOLOE("yoloe-v8l.yaml")
model.load("pretrain/best.pt")

#source images 
model.predict(images,save=True, prompts=visuals, predictor=YOLOEVPSegPredictor, 
               return_vpe=True)
model.set_classes(["object0","object1","object2","object3","object4"], torch.nn.functional.normalize(model.predictor.vpe.mean(dim=0, keepdim=True), dim=-1, p=2))

#target image
target_image = "image53.png"
model.predictor = None  # remove VPPredictor
model.predict([target_image], save=True, conf=0.3,iou=0.3)

@shataxiDubey
Copy link

@jameslahm , it would be very helpful if multi visual prompt feature (available in multi_source_predict_vp branch) is integrated in the main branch.

Zero shot detection by providing multiple visual prompts is feasible in branch multi_source_predict_vp but transferring pretrained model on custom dataset causes issue in this branch.

As in multi_source_predict_vp branch, there are various checks like save_json should be True which is not in main branch. Also I tried training by keeping save_json = True but I could not get the confusion matrix.

It would be useful to have multiple visual prompts and issue free training in one branch.
Thanks.

@jameslahm
Copy link
Collaborator

@Akhp888 Hi, did you train yolov8 model rather than yoloe-v8 model?

@jameslahm
Copy link
Collaborator

@shataxiDubey Merged in c21bc24 ;)

@Akhp888
Copy link
Author

Akhp888 commented Mar 31, 2025

@Akhp888 Hi, did you train yolov8 model rather than yoloe-v8 model?

Yes, I trained a yolov8 model as it said yoloe can out of the box load yolov8 models .

@jameslahm
Copy link
Collaborator

Hi, @Akhp888 , we thought that you need to train model based on yoloe-v8l-seg with visual prompts of your data rather than training a yolov8 model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants