Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Outdate] Add vision capability #1926

Closed
wants to merge 14 commits into from
Closed

[Outdate] Add vision capability #1926

wants to merge 14 commits into from

Conversation

BeibinLi
Copy link
Collaborator

@BeibinLi BeibinLi commented Mar 9, 2024

This PR is moved to #2025

We want to have a "vision capability" so that it can be added to conversable agents even if these agents are not connected to multimodal models.

See a feature overview in Issue #1975

Why are these changes needed?

Related issue number

Checks

@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2024

Codecov Report

Attention: Patch coverage is 71.21212% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 48.18%. Comparing base (f78985d) to head (a7c27ca).

Files Patch % Lines
autogen/agentchat/conversable_agent.py 50.00% 8 Missing ⚠️
...gentchat/contrib/capabilities/vision_capability.py 84.78% 5 Missing and 2 partials ⚠️
autogen/agentchat/contrib/img_utils.py 0.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1926       +/-   ##
===========================================
+ Coverage   37.53%   48.18%   +10.64%     
===========================================
  Files          65       66        +1     
  Lines        6913     6963       +50     
  Branches     1521     1656      +135     
===========================================
+ Hits         2595     3355      +760     
+ Misses       4092     3326      -766     
- Partials      226      282       +56     
Flag Coverage Δ
unittests 48.05% <71.21%> (+10.51%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@WaelKarkoub WaelKarkoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!!

Copy link
Contributor

@WaelKarkoub WaelKarkoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@rickyloynd-microsoft
Copy link
Contributor

rickyloynd-microsoft commented Mar 9, 2024

Awesome PR!!

From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?

Related to my question, there's this explanation from the top of the new notebook:

There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.

Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?

@ekzhu
Copy link
Collaborator

ekzhu commented Mar 9, 2024

Questions:

(1) If user already has access to a vision model, why not just use the vision model directly in the llm_config of the conversable agent, instead of adding a capability and do twice as much inference?

(2) In the notebook, the figure creator agent clearly is using nested chat pattern, can you use the register_nested_chats instead of registering a reply funciton? Let's dog-food our own API more to set examples.

(3) I think we should not subclass AssistantAgent, we should consider it a sealed class. Instead, we should use ConversableAgent directly when the system message instruction is customized. This is for code stability -- we might update AssistantAgent's default system message and that can have un-foreseeable consequences in the subclasses. So in our documentation we should not be subclassing AssistantAgent.

@BeibinLi
Copy link
Collaborator Author

Awesome PR!!

From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?

Related to my question, there's this explanation from the top of the new notebook:

There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.

Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?

Good point. Your understanding is correct. I will add more explanations to describe the difference between them.

Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the messages before calling the client. However, this implementation causes some issues during orchestration with other agents. For instance, if the group chat manager is typically a language model without any vision capabilities, it would not see the image and decide which other agent to route to.

There are a few different options to resolve this issue:

  1. Change the ConversableAgent completely by adding multimodal image processing functions. So, if the model inside llm_config is a multimodal model, it will perform multimodal image processing; otherwise, it will function as it does today.
  2. As in the current PR, include a vision capability when needed. In this case, the vision capability (connected to GPT-4V) will transcribe the image into a caption, and then the conversable agent has the context about what's in the image.
  3. Explore other methods (such as multi-inheritance, which is not preferred).

@BeibinLi
Copy link
Collaborator Author

BeibinLi commented Mar 10, 2024

Questions:

(1) If user already has access to a vision model, why not just use the vision model directly in the llm_config of the conversable agent, instead of adding a capability and do twice as much inference?

(2) In the notebook, the figure creator agent clearly is using nested chat pattern, can you use the register_nested_chats instead of registering a reply funciton? Let's dog-food our own API more to set examples.

(3) I think we should not subclass AssistantAgent, we should consider it a sealed class. Instead, we should use ConversableAgent directly when the system message instruction is customized. This is for code stability -- we might update AssistantAgent's default system message and that can have un-foreseeable consequences in the subclasses. So in our documentation we should not be subclassing AssistantAgent.

Questions:

(1) If user already has access to a vision model, why not just use the vision model directly in the llm_config of the conversable agent, instead of adding a capability and do twice as much inference?

(2) In the notebook, the figure creator agent clearly is using nested chat pattern, can you use the register_nested_chats instead of registering a reply funciton? Let's dog-food our own API more to set examples.

(3) I think we should not subclass AssistantAgent, we should consider it a sealed class. Instead, we should use ConversableAgent directly when the system message instruction is customized. This is for code stability -- we might update AssistantAgent's default system message and that can have un-foreseeable consequences in the subclasses. So in our documentation we should not be subclassing AssistantAgent.


@ekzhu @LittleLittleCloud

  1. See my comments for Ricky above. The main problem is that regular conversable agents (such as a group chat manager or user_proxy_agent) cannot process images, even if we provide a GPT-4V llm_config for them. We need to add image processing utilities to ConversableAgent.

  2. Yes, good point. If you can help me set it as nested_chat, that would be great.

  3. Do you mean class FigureCreator(AssistantAgent):? If yes, I am happy to change that to ConversableAgent.

@rickyloynd-microsoft
Copy link
Contributor

Awesome PR!!
From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?
Related to my question, there's this explanation from the top of the new notebook:

There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.

Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?

Good point. Your understanding is correct. I will add more explanations to describe the difference between them.

Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the messages before calling the client. However, this implementation causes some issues during orchestration with other agents. For instance, if the group chat manager is typically a language model without any vision capabilities, it would not see the image and decide which other agent to route to.

There are a few different options to resolve this issue:

  1. Change the ConversableAgent completely by adding multimodal image processing functions. So, if the model inside llm_config is a multimodal model, it will perform multimodal image processing; otherwise, it will function as it does today.
  2. As in the current PR, include a vision capability when needed. In this case, the vision capability (connected to GPT-4V) will transcribe the image into a caption, and then the conversable agent has the context about what's in the image.
  3. Explore other methods (such as multi-inheritance, which is not preferred).

Thank you for the explanation.

Do you think that the caption bottleneck could be removed? For instance, instead of providing just a caption for the image, could the lmm inside VisionCapability be prompted to provide a full answer to the user's query? Then that response (instead of a mere caption) would be added to the last received message, with a preface such as "Based on analyzing the image, a possible response would be:". The base agent's LLM would then be able to copy this response, or modify it while taking into consideration extra information that may be added to the message by RAG, or memories from teachability. If so, this might combine the full power of MultimodalConversableAgent with other capabilities.

@BeibinLi
Copy link
Collaborator Author

Awesome PR!!
From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?
Related to my question, there's this explanation from the top of the new notebook:

There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.

Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?

Good point. Your understanding is correct. I will add more explanations to describe the difference between them.
Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the messages before calling the client. However, this implementation causes some issues during orchestration with other agents. For instance, if the group chat manager is typically a language model without any vision capabilities, it would not see the image and decide which other agent to route to.
There are a few different options to resolve this issue:

  1. Change the ConversableAgent completely by adding multimodal image processing functions. So, if the model inside llm_config is a multimodal model, it will perform multimodal image processing; otherwise, it will function as it does today.
  2. As in the current PR, include a vision capability when needed. In this case, the vision capability (connected to GPT-4V) will transcribe the image into a caption, and then the conversable agent has the context about what's in the image.
  3. Explore other methods (such as multi-inheritance, which is not preferred).

Thank you for the explanation.

Do you think that the caption bottleneck could be removed? For instance, instead of providing just a caption for the image, could the lmm inside VisionCapability be prompted to provide a full answer to the user's query? Then that response (instead of a mere caption) would be added to the last received message, with a preface such as "Based on analyzing the image, a possible response would be:". The base agent's LLM would then be able to copy this response, or modify it while taking into consideration extra information that may be added to the message by RAG, or memories from teachability. If so, this might combine the full power of MultimodalConversableAgent with other capabilities.

Yes, good idea.

  1. I think we can add this feature with process_last_received_message rather than process_all_messages_before_reply, because the question might be related to earlier conversations. What do you think? I can proceed and change the implementation.
  2. If we answer the question with vision capability, why not use multimodal agent directly? So, adding multimodal directly to ConversableAgent would be better for this task. I think maybe this design would be better and easier to understand for users.

I felt reluctant to change ConversableAgent directly last year because the vision capabilities are still "experimental" at that time. So, we created a brand-new MultimodalConversableAgent. As (1) more and more users are using vision, (2) patching a vision capability is not elegant nor comprehensive, and (3) the conversable agent has been changed several times already in the past few months, I think adding vision directly into the ConversableAgent might make more sense. What do you think @sonichi

@sonichi
Copy link
Contributor

sonichi commented Mar 11, 2024

Awesome PR!!
From the user's point of view, what are the differences in behavior between a MultimodalConversableAgent and a ConversableAgent to which the VisionCapability has been added? In the simplest case where the user wants to instantiate just one of these agents (in addition to the user_proxy), when should the user choose one option over the other? Does MultimodalConversableAgent provide certain functionality which a ConversableAgent with VisionCapability would not?
Related to my question, there's this explanation from the top of the new notebook:

There are two distinct ways to use multimodal models in AutoGen:
1. MultimodalAgent, e.g., supported by GPT-4V, which has reasoning and thinking skills. It can interact with other agents the same as other ConversableAgents.
2. VisionCapability. When a LLM-based agent does not have vision capabilities, we can add a vision capability to it, by transcribing an image into caption.

Then as shown in the notebook, the VisionCapability has access to GPT-4V even though its base agent does not. So maybe the answer to my question is that two different models are (potentially) involved when using the capability path, where the vision model provides the caption for the text model to consume. While MultimodalConversableAgent does it all through a single model. If so, then an agent with VisionCapability might be limited by the bottleneck of the caption?

Good point. Your understanding is correct. I will add more explanations to describe the difference between them.
Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the messages before calling the client. However, this implementation causes some issues during orchestration with other agents. For instance, if the group chat manager is typically a language model without any vision capabilities, it would not see the image and decide which other agent to route to.
There are a few different options to resolve this issue:

  1. Change the ConversableAgent completely by adding multimodal image processing functions. So, if the model inside llm_config is a multimodal model, it will perform multimodal image processing; otherwise, it will function as it does today.
  2. As in the current PR, include a vision capability when needed. In this case, the vision capability (connected to GPT-4V) will transcribe the image into a caption, and then the conversable agent has the context about what's in the image.
  3. Explore other methods (such as multi-inheritance, which is not preferred).

Thank you for the explanation.
Do you think that the caption bottleneck could be removed? For instance, instead of providing just a caption for the image, could the lmm inside VisionCapability be prompted to provide a full answer to the user's query? Then that response (instead of a mere caption) would be added to the last received message, with a preface such as "Based on analyzing the image, a possible response would be:". The base agent's LLM would then be able to copy this response, or modify it while taking into consideration extra information that may be added to the message by RAG, or memories from teachability. If so, this might combine the full power of MultimodalConversableAgent with other capabilities.

Yes, good idea.

  1. I think we can add this feature with process_last_received_message rather than process_all_messages_before_reply, because the question might be related to earlier conversations. What do you think? I can proceed and change the implementation.
  2. If we answer the question with vision capability, why not use multimodal agent directly? So, adding multimodal directly to ConversableAgent would be better for this task. I think maybe this design would be better and easier to understand for users.

I felt reluctant to change ConversableAgent directly last year because the vision capabilities are still "experimental" at that time. So, we created a brand-new MultimodalConversableAgent. As (1) more and more users are using vision, (2) patching a vision capability is not elegant nor comprehensive, and (3) the conversable agent has been changed several times already in the past few months, I think adding vision directly into the ConversableAgent might make more sense. What do you think @sonichi

Not sure what you mean by 'adding vision directly". Do you refer to this PR or a new proposal?

@BeibinLi
Copy link
Collaborator Author

BeibinLi commented Mar 13, 2024

@rickyloynd-microsoft @ekzhu @sonichi I have addressed all issues and concerns for this PR. Please take a look.

Here are two unaddressed comments, which will be handled in future PRs.

  1. Nested Chat in the Notebook: I will include nest chat in a separate PR, because it is irrelated to the vision capability.
  2. Answer question directly in VisionCapability: I decided not to include this feature in VisionCapability for design conciseness. This issue can be resolved with a multimodal conversable agent, see the [Major Update 1] in the summary [Roadmap] Multimodal Orchestration #1975. So, I decided not to do it here.

@BeibinLi
Copy link
Collaborator Author

BeibinLi commented Mar 14, 2024

@afourney
Thanks for your suggestion!

Regarding your first comment, I have a prototype in PR #2013. It is still working in progress, and I will make it ready for review soon.
Regarding the system message, good idea and I have updated it!

@BeibinLi
Copy link
Collaborator Author

@sonichi Updated~

@sonichi
Copy link
Contributor

sonichi commented Mar 14, 2024

@sonichi Updated~

I meant to make the PR from a branch in the upstream repo, as opposed to a forked repo, because we use pull_request as the trigger now for the openai workflows.

@BeibinLi
Copy link
Collaborator Author

Closing this PR, and moving to #2025 for testing purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants