-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Outdate] Add vision capability #1926
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1926 +/- ##
===========================================
+ Coverage 37.53% 48.18% +10.64%
===========================================
Files 65 66 +1
Lines 6913 6963 +50
Branches 1521 1656 +135
===========================================
+ Hits 2595 3355 +760
+ Misses 4092 3326 -766
- Partials 226 282 +56
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
Awesome PR!! From the user's point of view, what are the differences in behavior between a Related to my question, there's this explanation from the top of the new notebook:
Then as shown in the notebook, the |
Questions: (1) If user already has access to a vision model, why not just use the vision model directly in the llm_config of the conversable agent, instead of adding a capability and do twice as much inference? (2) In the notebook, the figure creator agent clearly is using nested chat pattern, can you use the (3) I think we should not subclass |
Good point. Your understanding is correct. I will add more explanations to describe the difference between them. Currently (before this PR), the MultimodalAgent has special handling for the input message content. It will read the image and format the There are a few different options to resolve this issue:
|
|
Thank you for the explanation. Do you think that the caption bottleneck could be removed? For instance, instead of providing just a caption for the image, could the lmm inside |
Yes, good idea.
I felt reluctant to change |
Not sure what you mean by 'adding vision directly". Do you refer to this PR or a new proposal? |
@rickyloynd-microsoft @ekzhu @sonichi I have addressed all issues and concerns for this PR. Please take a look. Here are two unaddressed comments, which will be handled in future PRs.
|
@sonichi Updated~ |
I meant to make the PR from a branch in the upstream repo, as opposed to a forked repo, because we use pull_request as the trigger now for the openai workflows. |
Closing this PR, and moving to #2025 for testing purposes. |
This PR is moved to #2025
We want to have a "vision capability" so that it can be added to conversable agents even if these agents are not connected to multimodal models.
See a feature overview in Issue #1975
Why are these changes needed?
Related issue number
Checks