-
Notifications
You must be signed in to change notification settings - Fork 312
feat: Initial config and skeleton for llama 3.2 Vision #2472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Initial config and skeleton for llama 3.2 Vision #2472
Conversation
Summary of ChangesHello @Vivek1106-04, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the essential groundwork for integrating Llama 3.2 Vision models into Keras Hub. It introduces the foundational configuration class, Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the initial skeleton for the Llama 3.2 Vision model, including the configuration and backbone classes. The changes establish the basic structure, but there are critical issues with model serialization that need to be addressed. Specifically, the get_config methods in both new classes do not correctly handle serialization of configuration objects, which would prevent saving and loading models. I've provided suggestions to fix the serialization logic in Llama3VisionBackbone and pointed out a similar issue in Llama3VisionConfig. I also noted some minor formatting issues.
| def get_config(self): | ||
| return {"config": self.config} No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_config method is returning the self.config object directly. This object is not serializable, which will cause errors when trying to save the model (e.g., with model.save()). The get_config method must return a JSON-serializable dictionary.
To fix this and align with Keras serialization patterns, you should:
- In
get_config, callsuper().get_config()to include base properties likenameandtrainable, as recommended by the style guide. - Serialize the
configobject by calling itsget_config()method. - Implement the
from_configclassmethod to correctly reconstruct the backbone from the serialized configuration. This is necessary because the defaultfrom_configfromBackbonewon't know how to handle the nested config object.
Additionally, the file is missing a final newline. Please run the code formatter to fix this.
| def get_config(self): | |
| return {"config": self.config} | |
| def get_config(self): | |
| config = super().get_config() | |
| config["config"] = self.config.get_config() | |
| return config | |
| @classmethod | |
| def from_config(cls, config): | |
| from .llama3_vision_config import Llama3VisionConfig | |
| config_data = config.pop("config") | |
| vision_config = Llama3VisionConfig(**config_data) | |
| return cls(config=vision_config, **config) | |
References
- The
get_config()method should chain to its superclass to ensure base properties are preserved, as outlined in the layer implementation guidelines. (link) - All components must be serializable to support saving and loading, which is a core part of the validation process. (link)
- Code should be formatted with
ruff. A missing final newline is a formatting issue that the tool would fix. (link)
| config.update( | ||
| { | ||
| "vision_encoder_config": self.vision_encoder_config, | ||
| "vision_projection_dim": self.vision_projection_dim, | ||
| "cross_attention_layers": self.cross_attention_layers, | ||
| } | ||
| ) | ||
| return config No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_config method directly includes self.vision_encoder_config. According to the docstring, this can be a config instance, which is not JSON-serializable. This will break serialization when Llama3VisionBackbone.get_config() is called. To ensure the entire configuration is serializable, you should convert self.vision_encoder_config to a dictionary if it's an object, for instance by calling a .get_config() method on it. This is crucial for model saving and loading.
Also, the file is missing a final newline. Please run the code formatter to add it.
Description
This PR lays the groundwork for supporting Llama 3.2 Vision (11B/90B) in Keras Hub (Relates to Issue #2470 ).
It introduces:
Llama3VisionConfig: Extends the text backbone config to supportvision_encoder_configandcross_attention_layers.Llama3VisionBackbone: A skeleton class to establish the API surface for the multimodal architecture.Design Decisions
Llama3BackboneConfigto reuse the existing RoPE and Transformer settings.cross_attention_layerslist to the config. This allows us to handle the specific "interleaved" architecture of Llama 3.2 (where vision injection happens only at specific depth indices) flexibly.Next Steps
Llama3VisionEncoderclass.GatedCrossAttentionblock.