🚨 Unify 3D position ids#43972
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: ernie4_5_vl_moe, qwen2_5_vl, qwen2_vl |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/qwen2_5_vl", "models/qwen2_vl"] |
CI ResultsCommit Info
Model CI Report❌ 11 new failed tests from this PR 😭
|
|
run-slow: glm46v, glm4v, glm4v_moe, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
|
This comment contains models: ["models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe"] |
|
run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
1 similar comment
|
run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
|
run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
|
Workflow Run ⚙️💔 This comment contains |
|
run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe"] |
|
The slow CI takes really long now, it took the whole work-day for 13 models 🙃 |
|
run-slow: qwen3_vl, glm_image, qwen2_5_vl, glm4v, ernie_4_5vl_moe |
|
This comment contains models: ["models/glm4v", "models/glm_image", "models/qwen2_5_vl", "models/qwen3_vl"] |
…ing start-end tokens
|
run-slow: glm4v |
|
This comment contains models: ["models/glm4v"] |
CI ResultsCommit Info
Model CI Report❌ 33 new failed tests from this PR 😭
|
|
retriggering later, it took a bad commit to run CI! |
|
No worries, should I still take a look? |
|
Yes please, the rest is to adjust processor tests and re-run slow CI |
vasqu
left a comment
There was a problem hiding this comment.
Overall LGTM, just added a few smaller questions. Something that we could further improve is to not iterate over each sample within the batch but dw, probably better suited for a future PR that builds on this
(trusting that the tests will be fixed :D)
| } | ||
|
|
||
| for batch_idx, current_input_ids in enumerate(input_ids): | ||
| input_token_type = mm_token_type_ids[batch_idx] |
There was a problem hiding this comment.
Now that we require mm_token_type_ids I suspect this might be breaking for a lot of outside users?
Unsure how many really use our internals here tbh, would still mark it tho with 🚨 how does vllm handle this? Do they rely on our processors or do they provide the position ids on their own?
There was a problem hiding this comment.
Good point, vllm indeed does call get_rope_index. I will coordinate with Harry on that
| 2: iter(video_grid_thw) if video_grid_thw is not None else None, | ||
| } | ||
|
|
||
| for batch_idx, current_input_ids in enumerate(input_ids): |
There was a problem hiding this comment.
I think this can also be improved although not super easy. We would need to track offsets per batch sequences somehow and remove them from the found groups respectively. (tl;dr: One long flattened sequence to iterate over instead of per batch sample)
If you have the motivation, would be nice but not essential to this PR - I'd rather focus on the unification first.
There was a problem hiding this comment.
Hm, I am not sure if that will be justified since usually we get relatively small batch sizes, and iterating over each shouldn't need much time
I will see what we can do after merging the PR, but I'll put it low prio for now
There was a problem hiding this comment.
Yea no worries, just a thought on improvement since it could add up when we want to improve on big batch inference.
| model_kwargs.get("image_grid_thw") is not None or model_kwargs.get("video_grid_thw") is not None | ||
| if ( | ||
| is_input_ids | ||
| and model_kwargs.get("mm_token_type_ids") is not None |
There was a problem hiding this comment.
Do we really want to be that strict? We could also init with zeros otherwise? Unaware whether this would bring more silent issues instead
There was a problem hiding this comment.
I think without mm_token_types we will end up with garbage positions in any case. I added this line so we don't have to check inside rope_index and early return
This way, we fallback to basic incremental position ids from generate utils, and we don't have to duplocate that code inside get_rope_index
| # GLM4V splits video into segments per frame but there's only one `grid_thw` | ||
| # per whole video. We can't exhaus the iterator and have to re-use the grid | ||
| # while processing the same video! | ||
| if modality_type == 2: | ||
| if video_group_index == 0: | ||
| grid_thw = next(grid_iters[modality_type]) | ||
| video_group_index += 1 | ||
| video_group_index = 0 if video_group_index >= grid_thw[0] else video_group_index | ||
| else: | ||
| grid_thw = next(grid_iters[modality_type]) |
There was a problem hiding this comment.
This feels like an opportunity for a helper function? So much is the same, it's a shame if we could not use modular a bit more
|
run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe, video_llama_3 |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe", "models/video_llama_3"] |
CI ResultsCommit Info
The test failure analysis could not be completed. Please check the workflow run for details. |
|
Same tests failing as in |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe, video_llama_3 |
|
This PR is introducing a regression for video inputs for Qwen models, see #44479
|
#1120) ## Summary This PR fixes the `lce_forward` function for VL models, adding support for `mm_token_type_ids` optional parameter related to multimodal processing: - `glm4v` - `glm4v_moe` - `qwen2_vl` - `qwen2_5_vl` - `qwen3_vl` - `qwen3_vl_moe` Fix #1117. This fixes a ValueError in `model.generate()` with transformers > 5.2.0, after they merged: - huggingface/transformers#43972 See related issue downstream in TRL: - huggingface/trl#5216 - huggingface/trl#5201 ## Details Multimodal token type support: * Added the `mm_token_type_ids` optional argument (of type `torch.IntTensor`) to the signature of `lce_forward`, allowing for the specification of multimodal token type IDs. * Passed the `mm_token_type_ids` argument to the underlying model call, ensuring it is incorporated into the forward computation. ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> We checked this fix solves the issue downstream in TRL. - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
What does this PR do?
Following Ernie, we build 3d positions based on
mm_token_type_idsand the models will return them by default fromprocessor.We have a unified
get_vision_positionin the qwen2-vl model file, all other models just copy it from there. The utility build vision ids as the name suggests, and the models are free to manipulate on top as they wish. In most cases, the only thing that changes is the presence of new modalities or kwargs