-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Adds Image Generation Capability 2.0 #1907
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1907 +/- ##
===========================================
+ Coverage 37.53% 60.87% +23.33%
===========================================
Files 65 66 +1
Lines 6913 7013 +100
Branches 1521 1660 +139
===========================================
+ Hits 2595 4269 +1674
+ Misses 4092 2357 -1735
- Partials 226 387 +161
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
ekzhu
approved these changes
Mar 15, 2024
whiskyboy
pushed a commit
to whiskyboy/autogen
that referenced
this pull request
Apr 17, 2024
* adds image generation capability * add todo * readded cache * wip * fix content str bugs * removed todo: delete imshow * wip * fix circular imports * add notebook * improve prompt * improved text analyzer + notebook * notebook update * improve notebook * smaller notebook size * made changes to the wrong branch :( * resolve comments + 1 * adds doc strings * adds cache doc string * adds doc string to add_to_agent * adds doc string to ImageGeneration * instructions are not configurable * removed unnecessary imports * changed doc string location * more doc strings * improves testability * adds tests * adds cache test * added test to github workflow * compatible llm config format * configurable reply function position * skip_openai + better comments * fix test * fix test? * please fix test? * last fix test? * remove type hint * skip cache test * adds mock api key * dalle-2 test * fix dalle config * use apu key function --------- Co-authored-by: Chi Wang <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
models
Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.)
multimodal
language + vision, speech etc.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@BeibinLi @rickyloynd-microsoft @ekzhu I created this PR because the other PR (#1874) branch was based on my fork, which won't allow me to run openai tests. Closing #1874
Why are these changes needed?
Proof of concept to use agent capabilities as a way of multimodal communication. I found trying to extend agent capabilities to include multimodal interactions, without undertaking extensive refactoring, difficult.
I went with a modular approach, treating different modalities as distinct agent capabilities. This strategy streamlines the integration of multimodal functions and enhances the versatility of "simple" agents with minimal adjustments to the existing architecture.
For this PR, I experimented with image generation, since I've seen quite a bit of great work already done by @BeibinLi. The idea is that the user can add the ability to generate images to any of their existing agents. I architected this code by creating an abstract class called
ImageGenerator
, where the user can implement their image generator from their favorite API provider (there's an example for dalleDalleImageGenerator
). All the user has to do now is pass in the generator they like toImageGeneration
(the agent's ability to generate images) and add the ability to the agent.The way
ImageGeneration
works is by adding a custom reply function, where it checks "Did I receive a message asking me to generate an image? If so, what is the prompt?" and generates the image accordingly.The design idea laid by this image generation capability will allow for future extensions, such as:
Design-wise it does have some downfalls:
TextAnalyzerAgent
, so more llm calls, more token usage.I wrote a quick script to test out the functionality
test_image_generation.py
(planning to remove and add a notebook instead)image_gen_poc.mp4
Related issue number
Checks