[Model] Enable LoRA support for tower and connector in H2OVL#31696
[Model] Enable LoRA support for tower and connector in H2OVL#31696jeejeelee merged 2 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--31696.org.readthedocs.build/en/31696/ |
There was a problem hiding this comment.
Code Review
This pull request enables LoRA support for the vision tower and connector in H2OVL by implementing the necessary methods in H2OVLProcessingInfo. The changes look mostly correct, but there is a critical issue in get_mm_mapping where an incorrect name is used for the connector module. This will prevent LoRA from being applied to the connector. I've left a specific comment with a suggested fix.
e1cc743 to
283d1d6
Compare
283d1d6 to
205da08
Compare
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
205da08 to
31e388c
Compare
31e388c to
626c2ab
Compare
626c2ab to
8ded82a
Compare
8ded82a to
75f21ad
Compare
|
Sorry, I mentioned the wrong issue number. Please ignore it. |
75f21ad to
9d8bc39
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
@jeejeelee can you take a look ? |
|
@jeejeelee/ @DarkLight1337 can someone review this? |
86578ba to
1a699f3
Compare
@shwetha-s-poojary With the current implementation, H2OVL fails to initialize properly when LoRA is enabled. I suspect this is because the vision encoder tokens and the visual tokens in the language model do not have a 1:1 mapping for this model. |
Thanks for catching this. I am working on fixing the token count mismatch for H2OVL. |
Sure, you can use the following code to reproduce the issue: |
|
@linitra24 Hi all, apologies for the slow reply - feel free for someone else to pick this up if needed, I may not respond immediately. |
08bf532 to
03a9bab
Compare
|
Hi @shwetha-s-poojary, apologies for the delayed response. Given that InternVL (#32397) has already added LoRA support and implemented the |
03a9bab to
4ba14b6
Compare
|
@linitra24 I've followed the same and updated the PR.PTAL.. |
33751f7 to
167f2ea
Compare
Implement get_num_mm_encoder_tokens(), and get_num_mm_connector_tokens() in H2OVLProcessingInfo Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
167f2ea to
55d8710
Compare
…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…oject#31696) Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Summary
Implemented get_num_mm_encoder_tokens() and get_num_mm_connector_tokens() in H2OVLChatModel to enable LoRA support for the vision tower and connector. The H2OVL connector uses a 1:1 token mapping with the encoder, so both methods return the same count as the input image tokens.
Related to: #31479
Technical details
H2OVL-Mississippi uses a ViT-MLP-LLM architecture with InternViT-300M as the vision encoder. Images are split into 448×448 tiles (1–6 per image), producing 256–1,590 visual tokens. Pixel shuffling reduces each tile to 256 tokens, and MSAC adjusts the number of tiles at multiple scales. The MLP connector does not reduce token count, and a resized 448×448 thumbnail is included for full-image context.
Ref: paper
Test
Added a test verifying that encoder tokens are multiples of 256, connector tokens match the encoder, and budget clipping works.
This is understanding of the model; I’m open to feedback or suggestions for additional tests to further verify its correctness.