[Outdate] Add vision capability #1926

BeibinLi · 2024-03-09T00:01:14Z

This PR is moved to #2025

We want to have a "vision capability" so that it can be added to conversable agents even if these agents are not connected to multimodal models.

See a feature overview in Issue #1975

Why are these changes needed?

Related issue number

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

codecov-commenter · 2024-03-09T00:02:42Z

Codecov Report

Attention: Patch coverage is 71.21212% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 48.18%. Comparing base (f78985d) to head (a7c27ca).

Files	Patch %	Lines
autogen/agentchat/conversable_agent.py	50.00%	8 Missing ⚠️
...gentchat/contrib/capabilities/vision_capability.py	84.78%	5 Missing and 2 partials ⚠️
autogen/agentchat/contrib/img_utils.py	0.00%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1926       +/-   ##
===========================================
+ Coverage   37.53%   48.18%   +10.64%     
===========================================
  Files          65       66        +1     
  Lines        6913     6963       +50     
  Branches     1521     1656      +135     
===========================================
+ Hits         2595     3355      +760     
+ Misses       4092     3326      -766     
- Partials      226      282       +56

Flag	Coverage Δ
unittests	`48.05% <71.21%> (+10.51%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

WaelKarkoub

Very cool!!

autogen/agentchat/contrib/capabilities/vision_capability.py

WaelKarkoub

Looks good 👍

rickyloynd-microsoft · 2024-03-09T01:14:48Z

Awesome PR!!

From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?

Related to my question, there's this explanation from the top of the new notebook:

There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.

Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?

autogen/agentchat/contrib/capabilities/vision_capability.py

ekzhu · 2024-03-09T04:58:43Z

Questions:

(1) If user already has access to a vision model, why not just use the vision model directly in the llm_config of the conversable agent, instead of adding a capability and do twice as much inference?

(2) In the notebook, the figure creator agent clearly is using nested chat pattern, can you use the register_nested_chats instead of registering a reply funciton? Let's dog-food our own API more to set examples.

(3) I think we should not subclass AssistantAgent, we should consider it a sealed class. Instead, we should use ConversableAgent directly when the system message instruction is customized. This is for code stability -- we might update AssistantAgent's default system message and that can have un-foreseeable consequences in the subclasses. So in our documentation we should not be subclassing AssistantAgent.

BeibinLi · 2024-03-10T20:10:29Z

Awesome PR!!

From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?

Related to my question, there's this explanation from the top of the new notebook:
There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.
Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?

Good point. Your understanding is correct. I will add more explanations to describe the difference between them.

Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the messages before calling the client. However, this implementation causes some issues during orchestration with other agents. For instance, if the group chat manager is typically a language model without any vision capabilities, it would not see the image and decide which other agent to route to.

There are a few different options to resolve this issue:

Change the ConversableAgent completely by adding multimodal image processing functions. So, if the model inside llm_config is a multimodal model, it will perform multimodal image processing; otherwise, it will function as it does today.
As in the current PR, include a vision capability when needed. In this case, the vision capability (connected to GPT-4V) will transcribe the image into a caption, and then the conversable agent has the context about what's in the image.
Explore other methods (such as multi-inheritance, which is not preferred).

BeibinLi · 2024-03-10T20:15:30Z

Questions:

(1) If user already has access to a vision model, why not just use the vision model directly in the llm_config of the conversable agent, instead of adding a capability and do twice as much inference?

(2) In the notebook, the figure creator agent clearly is using nested chat pattern, can you use the register_nested_chats instead of registering a reply funciton? Let's dog-food our own API more to set examples.

(3) I think we should not subclass AssistantAgent, we should consider it a sealed class. Instead, we should use ConversableAgent directly when the system message instruction is customized. This is for code stability -- we might update AssistantAgent's default system message and that can have un-foreseeable consequences in the subclasses. So in our documentation we should not be subclassing AssistantAgent.

@ekzhu @LittleLittleCloud

See my comments for Ricky above. The main problem is that regular conversable agents (such as a group chat manager or user_proxy_agent) cannot process images, even if we provide a GPT-4V llm_config for them. We need to add image processing utilities to ConversableAgent.
Yes, good point. If you can help me set it as nested_chat, that would be great.
Do you mean class FigureCreator(AssistantAgent):? If yes, I am happy to change that to ConversableAgent.

rickyloynd-microsoft · 2024-03-10T21:47:21Z

Awesome PR!!
From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?
Related to my question, there's this explanation from the top of the new notebook:
There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.
Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?
Good point. Your understanding is correct. I will add more explanations to describe the difference between them.

Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the messages before calling the client. However, this implementation causes some issues during orchestration with other agents. For instance, if the group chat manager is typically a language model without any vision capabilities, it would not see the image and decide which other agent to route to.

There are a few different options to resolve this issue:

Change the ConversableAgent completely by adding multimodal image processing functions. So, if the model inside llm_config is a multimodal model, it will perform multimodal image processing; otherwise, it will function as it does today.

As in the current PR, include a vision capability when needed. In this case, the vision capability (connected to GPT-4V) will transcribe the image into a caption, and then the conversable agent has the context about what's in the image.

Explore other methods (such as multi-inheritance, which is not preferred).

Thank you for the explanation.

Do you think that the caption bottleneck could be removed? For instance, instead of providing just a caption for the image, could the lmm inside VisionCapability be prompted to provide a full answer to the user's query? Then that response (instead of a mere caption) would be added to the last received message, with a preface such as "Based on analyzing the image, a possible response would be:". The base agent's LLM would then be able to copy this response, or modify it while taking into consideration extra information that may be added to the message by RAG, or memories from teachability. If so, this might combine the full power of MultimodalConversableAgent with other capabilities.

BeibinLi · 2024-03-11T02:12:38Z

Awesome PR!!
From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?
Related to my question, there's this explanation from the top of the new notebook:
There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.
Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?
Good point. Your understanding is correct. I will add more explanations to describe the difference between them.
Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the messages before calling the client. However, this implementation causes some issues during orchestration with other agents. For instance, if the group chat manager is typically a language model without any vision capabilities, it would not see the image and decide which other agent to route to.
There are a few different options to resolve this issue:

Change the ConversableAgent completely by adding multimodal image processing functions. So, if the model inside llm_config is a multimodal model, it will perform multimodal image processing; otherwise, it will function as it does today.

As in the current PR, include a vision capability when needed. In this case, the vision capability (connected to GPT-4V) will transcribe the image into a caption, and then the conversable agent has the context about what's in the image.

Explore other methods (such as multi-inheritance, which is not preferred).
Thank you for the explanation.

Do you think that the caption bottleneck could be removed? For instance, instead of providing just a caption for the image, could the lmm inside VisionCapability be prompted to provide a full answer to the user's query? Then that response (instead of a mere caption) would be added to the last received message, with a preface such as "Based on analyzing the image, a possible response would be:". The base agent's LLM would then be able to copy this response, or modify it while taking into consideration extra information that may be added to the message by RAG, or memories from teachability. If so, this might combine the full power of MultimodalConversableAgent with other capabilities.

Yes, good idea.

I think we can add this feature with process_last_received_message rather than process_all_messages_before_reply, because the question might be related to earlier conversations. What do you think? I can proceed and change the implementation.
If we answer the question with vision capability, why not use multimodal agent directly? So, adding multimodal directly to ConversableAgent would be better for this task. I think maybe this design would be better and easier to understand for users.

I felt reluctant to change ConversableAgent directly last year because the vision capabilities are still "experimental" at that time. So, we created a brand-new MultimodalConversableAgent. As (1) more and more users are using vision, (2) patching a vision capability is not elegant nor comprehensive, and (3) the conversable agent has been changed several times already in the past few months, I think adding vision directly into the ConversableAgent might make more sense. What do you think @sonichi

sonichi · 2024-03-11T03:51:12Z

Awesome PR!!
From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?
Related to my question, there's this explanation from the top of the new notebook:
There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.
Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?
Good point. Your understanding is correct. I will add more explanations to describe the difference between them.
Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the messages before calling the client. However, this implementation causes some issues during orchestration with other agents. For instance, if the group chat manager is typically a language model without any vision capabilities, it would not see the image and decide which other agent to route to.
There are a few different options to resolve this issue:

Change the ConversableAgent completely by adding multimodal image processing functions. So, if the model inside llm_config is a multimodal model, it will perform multimodal image processing; otherwise, it will function as it does today.

As in the current PR, include a vision capability when needed. In this case, the vision capability (connected to GPT-4V) will transcribe the image into a caption, and then the conversable agent has the context about what's in the image.

Explore other methods (such as multi-inheritance, which is not preferred).
Thank you for the explanation.
Do you think that the caption bottleneck could be removed? For instance, instead of providing just a caption for the image, could the lmm inside VisionCapability be prompted to provide a full answer to the user's query? Then that response (instead of a mere caption) would be added to the last received message, with a preface such as "Based on analyzing the image, a possible response would be:". The base agent's LLM would then be able to copy this response, or modify it while taking into consideration extra information that may be added to the message by RAG, or memories from teachability. If so, this might combine the full power of MultimodalConversableAgent with other capabilities.
Yes, good idea.

I think we can add this feature with process_last_received_message rather than process_all_messages_before_reply, because the question might be related to earlier conversations. What do you think? I can proceed and change the implementation.

If we answer the question with vision capability, why not use multimodal agent directly? So, adding multimodal directly to ConversableAgent would be better for this task. I think maybe this design would be better and easier to understand for users.

I felt reluctant to change ConversableAgent directly last year because the vision capabilities are still "experimental" at that time. So, we created a brand-new MultimodalConversableAgent. As (1) more and more users are using vision, (2) patching a vision capability is not elegant nor comprehensive, and (3) the conversable agent has been changed several times already in the past few months, I think adding vision directly into the ConversableAgent might make more sense. What do you think @sonichi

Not sure what you mean by 'adding vision directly". Do you refer to this PR or a new proposal?

…on_capability

BeibinLi · 2024-03-13T20:38:39Z

@rickyloynd-microsoft @ekzhu @sonichi I have addressed all issues and concerns for this PR. Please take a look.

Here are two unaddressed comments, which will be handled in future PRs.

Nested Chat in the Notebook: I will include nest chat in a separate PR, because it is irrelated to the vision capability.
Answer question directly in VisionCapability: I decided not to include this feature in VisionCapability for design conciseness. This issue can be resolved with a multimodal conversable agent, see the [Major Update 1] in the summary [Roadmap] Multimodal Orchestration #1975. So, I decided not to do it here.

BeibinLi · 2024-03-14T16:19:09Z

@afourney
Thanks for your suggestion!

Regarding your first comment, I have a prototype in PR #2013. It is still working in progress, and I will make it ready for review soon.
Regarding the system message, good idea and I have updated it!

BeibinLi · 2024-03-14T16:25:59Z

@sonichi Updated~

sonichi · 2024-03-14T18:15:28Z

@sonichi Updated~

I meant to make the PR from a branch in the upstream repo, as opposed to a forked repo, because we use pull_request as the trigger now for the openai workflows.

BeibinLi · 2024-03-14T23:19:28Z

Closing this PR, and moving to #2025 for testing purposes.

Add vision capability

1be5865

BeibinLi requested review from rickyloynd-microsoft, ekzhu, WaelKarkoub, afourney and jackgerrits March 9, 2024 00:01

Configurate: description_prompt

7f7e746

WaelKarkoub reviewed Mar 9, 2024

View reviewed changes

autogen/agentchat/contrib/capabilities/vision_capability.py Show resolved Hide resolved

autogen/agentchat/contrib/capabilities/vision_capability.py Outdated Show resolved Hide resolved

BeibinLi requested review from LittleLittleCloud and WaelKarkoub March 9, 2024 00:38

Print warning instead of raising issues for type

8ff43fa

WaelKarkoub approved these changes Mar 9, 2024

View reviewed changes

rickyloynd-microsoft reviewed Mar 9, 2024

View reviewed changes

autogen/agentchat/contrib/capabilities/vision_capability.py Outdated Show resolved Hide resolved

BeibinLi mentioned this pull request Mar 12, 2024

[Roadmap] Multimodal Orchestration #1975

Closed

BeibinLi added 7 commits March 13, 2024 12:17

Merge branch 'main' of https://github.com/microsoft/autogen into visi…

a3cad84

…on_capability

Skip vision capability test if dependencies not installed

24e0fec

Append "vision" to agent's system message when enabled VisionCapability

72a9373

GPT-4V notebook update with ConversableAgent

937a591

Clean GPT-4V notebook

8e6bac9

Add vision capability test to workflow

9e5e720

Lint import

225dbc6

Update system message for vision capability

a7c27ca

BeibinLi mentioned this pull request Mar 14, 2024

Add vision capability #2025

Merged

3 tasks

BeibinLi had a problem deploying to openai1 March 14, 2024 23:19 — with GitHub Actions Failure

BeibinLi closed this Mar 14, 2024

BeibinLi changed the title ~~Add vision capability~~ [Outdate] Add vision capability Mar 14, 2024

BeibinLi had a problem deploying to openai1 April 13, 2024 16:20 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Outdate] Add vision capability #1926

[Outdate] Add vision capability #1926

BeibinLi commented Mar 9, 2024 •

edited

Loading

codecov-commenter commented Mar 9, 2024 •

edited

Loading

WaelKarkoub left a comment

WaelKarkoub left a comment

rickyloynd-microsoft commented Mar 9, 2024 •

edited

Loading

ekzhu commented Mar 9, 2024 •

edited

Loading

BeibinLi commented Mar 10, 2024

BeibinLi commented Mar 10, 2024 •

edited

Loading

rickyloynd-microsoft commented Mar 10, 2024

BeibinLi commented Mar 11, 2024

sonichi commented Mar 11, 2024

BeibinLi commented Mar 13, 2024 •

edited

Loading

BeibinLi commented Mar 14, 2024 •

edited

Loading

BeibinLi commented Mar 14, 2024

sonichi commented Mar 14, 2024 •

edited

Loading

BeibinLi commented Mar 14, 2024

[Outdate] Add vision capability #1926

[Outdate] Add vision capability #1926

Conversation

BeibinLi commented Mar 9, 2024 • edited Loading

This PR is moved to #2025

Why are these changes needed?

Related issue number

Checks

codecov-commenter commented Mar 9, 2024 • edited Loading

Codecov Report

WaelKarkoub left a comment

Choose a reason for hiding this comment

WaelKarkoub left a comment

Choose a reason for hiding this comment

rickyloynd-microsoft commented Mar 9, 2024 • edited Loading

ekzhu commented Mar 9, 2024 • edited Loading

BeibinLi commented Mar 10, 2024

BeibinLi commented Mar 10, 2024 • edited Loading

rickyloynd-microsoft commented Mar 10, 2024

BeibinLi commented Mar 11, 2024

sonichi commented Mar 11, 2024

BeibinLi commented Mar 13, 2024 • edited Loading

BeibinLi commented Mar 14, 2024 • edited Loading

BeibinLi commented Mar 14, 2024

sonichi commented Mar 14, 2024 • edited Loading

BeibinLi commented Mar 14, 2024

BeibinLi commented Mar 9, 2024 •

edited

Loading

codecov-commenter commented Mar 9, 2024 •

edited

Loading

rickyloynd-microsoft commented Mar 9, 2024 •

edited

Loading

ekzhu commented Mar 9, 2024 •

edited

Loading

BeibinLi commented Mar 10, 2024 •

edited

Loading

BeibinLi commented Mar 13, 2024 •

edited

Loading

BeibinLi commented Mar 14, 2024 •

edited

Loading

sonichi commented Mar 14, 2024 •

edited

Loading