Skip to content

🚨 Unify 3D position ids#43972

Merged
zucchini-nlp merged 16 commits intohuggingface:mainfrom
zucchini-nlp:position-ids-qwen
Feb 24, 2026
Merged

🚨 Unify 3D position ids#43972
zucchini-nlp merged 16 commits intohuggingface:mainfrom
zucchini-nlp:position-ids-qwen

Conversation

@zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Feb 13, 2026

What does this PR do?

Following Ernie, we build 3d positions based on mm_token_type_ids and the models will return them by default from processor.

We have a unified get_vision_position in the qwen2-vl model file, all other models just copy it from there. The utility build vision ids as the name suggests, and the models are free to manipulate on top as they wish. In most cases, the only thing that changes is the presence of new modalities or kwargs

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member Author

run-slow: ernie4_5_vl_moe, qwen2_5_vl, qwen2_vl

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/qwen2_5_vl", "models/qwen2_vl"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN de19dc95 workflow commit (merge commit)
PR c066f48b branch commit (from PR)
main cfef7f14 base commit (on main)

Model CI Report

11 new failed tests from this PR 😭

  • ernie4_5_vl_moe:
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test (✅ ⟹ ❌)
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch (✅ ⟹ ❌)
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch_different_resolutions (✅ ⟹ ❌)
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch_wo_image (✅ ⟹ ❌)
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_expand (✅ ⟹ ❌)
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_with_video (✅ ⟹ ❌)

  • qwen2_5_vl:
    tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py::Qwen2_5_VLIntegrationTest::test_small_model_integration_test_with_video (❌ ⟹ ❌)
    tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py::Qwen2_5_VLProcessorTest::test_apply_chat_template_video_frame_sampling (✅ ⟹ ❌)
    tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py::Qwen2_5_VLProcessorTest::test_model_input_names (✅ ⟹ ❌)

  • qwen2_vl:
    tests/models/qwen2_vl/test_processing_qwen2_vl.py::Qwen2VLProcessorTest::test_apply_chat_template_video_frame_sampling (✅ ⟹ ❌)
    tests/models/qwen2_vl/test_processing_qwen2_vl.py::Qwen2VLProcessorTest::test_model_input_names (✅ ⟹ ❌)

@zucchini-nlp
Copy link
Member Author

run-slow: glm46v, glm4v, glm4v_moe, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe"]
quantizations: []

@zucchini-nlp zucchini-nlp changed the title [WIP] Unify 3D position ids Unify 3D position ids Feb 17, 2026
@zucchini-nlp
Copy link
Member Author

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

1 similar comment
@zucchini-nlp
Copy link
Member Author

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

@zucchini-nlp
Copy link
Member Author

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 0bd60dc6 workflow commit (merge commit)
PR e34bc8ad branch commit (from PR)
main 66b40e69 base commit (on main)

⚠️ No test being reported (jobs are skipped or cancelled)!

@github-actions
Copy link
Contributor

Workflow Run ⚙️💔 This comment contains run-slow, but unknown error occurred and the workflow run aborted!

@zucchini-nlp
Copy link
Member Author

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 477a82b0 workflow commit (merge commit)
PR 7a4df899 branch commit (from PR)
main ffbea2db base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@zucchini-nlp
Copy link
Member Author

The slow CI takes really long now, it took the whole work-day for 13 models 🙃

@zucchini-nlp
Copy link
Member Author

run-slow: qwen3_vl, glm_image, qwen2_5_vl, glm4v, ernie_4_5vl_moe

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/glm4v", "models/glm_image", "models/qwen2_5_vl", "models/qwen3_vl"]
quantizations: []

@zucchini-nlp zucchini-nlp requested a review from vasqu February 19, 2026 11:39
@zucchini-nlp
Copy link
Member Author

run-slow: glm4v

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/glm4v"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 18899f4e workflow commit (merge commit)
PR 7d34fbcd branch commit (from PR)
main 00cc937c base commit (on main)

Model CI Report

33 new failed tests from this PR 😭

  • glm4v:
    tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test (✅ ⟹ ❌)
    tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_batch (✅ ⟹ ❌)
    tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_batch_different_resolutions (✅ ⟹ ❌)
    tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_batch_wo_image (✅ ⟹ ❌)
    tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_expand (✅ ⟹ ❌)
    tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_with_video (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_assistant_mask (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_decoded_video_0 (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_image_0 (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_image_1 (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_video_0 (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_video_1 (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_video_frame_sampling (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_get_num_multimodal_tokens_matches_processor_call (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_image_processor_defaults_preserved_by_image_kwargs (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_kwargs_overrides_default_image_processor_kwargs (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_kwargs_overrides_default_tokenizer_kwargs (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_kwargs_overrides_default_tokenizer_kwargs_video (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_kwargs_overrides_default_video_processor_kwargs (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_model_input_names (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_processor_text_has_no_visual (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_processor_with_multiple_inputs (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_structured_kwargs_nested (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_structured_kwargs_nested_from_dict (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_structured_kwargs_nested_from_dict_video (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_structured_kwargs_nested_video (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_tokenizer_defaults_preserved_by_kwargs (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_tokenizer_defaults_preserved_by_kwargs_video (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_unstructured_kwargs (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_unstructured_kwargs_batched (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_unstructured_kwargs_batched_video (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_unstructured_kwargs_video (✅ ⟹ ❌)
    tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_video_processor_defaults_preserved_by_video_kwargs (✅ ⟹ ❌)

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Feb 20, 2026

retriggering later, it took a bad commit to run CI!

@vasqu
Copy link
Contributor

vasqu commented Feb 20, 2026

No worries, should I still take a look?

@zucchini-nlp
Copy link
Member Author

Yes please, the rest is to adjust processor tests and re-run slow CI

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just added a few smaller questions. Something that we could further improve is to not iterate over each sample within the batch but dw, probably better suited for a future PR that builds on this

(trusting that the tests will be fixed :D)

}

for batch_idx, current_input_ids in enumerate(input_ids):
input_token_type = mm_token_type_ids[batch_idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we require mm_token_type_ids I suspect this might be breaking for a lot of outside users?

Unsure how many really use our internals here tbh, would still mark it tho with 🚨 how does vllm handle this? Do they rely on our processors or do they provide the position ids on their own?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, vllm indeed does call get_rope_index. I will coordinate with Harry on that

2: iter(video_grid_thw) if video_grid_thw is not None else None,
}

for batch_idx, current_input_ids in enumerate(input_ids):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can also be improved although not super easy. We would need to track offsets per batch sequences somehow and remove them from the found groups respectively. (tl;dr: One long flattened sequence to iterate over instead of per batch sample)

If you have the motivation, would be nice but not essential to this PR - I'd rather focus on the unification first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I am not sure if that will be justified since usually we get relatively small batch sizes, and iterating over each shouldn't need much time

I will see what we can do after merging the PR, but I'll put it low prio for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea no worries, just a thought on improvement since it could add up when we want to improve on big batch inference.

model_kwargs.get("image_grid_thw") is not None or model_kwargs.get("video_grid_thw") is not None
if (
is_input_ids
and model_kwargs.get("mm_token_type_ids") is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to be that strict? We could also init with zeros otherwise? Unaware whether this would bring more silent issues instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think without mm_token_types we will end up with garbage positions in any case. I added this line so we don't have to check inside rope_index and early return

This way, we fallback to basic incremental position ids from generate utils, and we don't have to duplocate that code inside get_rope_index

Comment on lines +1122 to +1131
# GLM4V splits video into segments per frame but there's only one `grid_thw`
# per whole video. We can't exhaus the iterator and have to re-use the grid
# while processing the same video!
if modality_type == 2:
if video_group_index == 0:
grid_thw = next(grid_iters[modality_type])
video_group_index += 1
video_group_index = 0 if video_group_index >= grid_thw[0] else video_group_index
else:
grid_thw = next(grid_iters[modality_type])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an opportunity for a helper function? So much is the same, it's a shame if we could not use modular a bit more

@zucchini-nlp zucchini-nlp changed the title Unify 3D position ids :rotating_lights: Unify 3D position ids Feb 23, 2026
@zucchini-nlp zucchini-nlp changed the title :rotating_lights: Unify 3D position ids 🚨 Unify 3D position ids Feb 23, 2026
@zucchini-nlp
Copy link
Member Author

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe, video_llama_3

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe", "models/video_llama_3"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN cc65d857 workflow commit (merge commit)
PR 42a886f3 branch commit (from PR)
main a3dcad9e base commit (on main)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

@zucchini-nlp
Copy link
Member Author

Same tests failing as in main for slow CI, verified manually with the latest run. I'll merge when vLLM is oke with this change

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe, video_llama_3

@tomaarsen
Copy link
Member

This PR is introducing a regression for video inputs for Qwen models, see #44479

  • Tom Aarsen

github-merge-queue bot pushed a commit to linkedin/Liger-Kernel that referenced this pull request Mar 6, 2026
#1120)

## Summary

This PR fixes the `lce_forward` function for VL models, adding support
for `mm_token_type_ids` optional parameter related to multimodal
processing:
- `glm4v`
- `glm4v_moe`
- `qwen2_vl`
- `qwen2_5_vl`
- `qwen3_vl`
- `qwen3_vl_moe`

Fix #1117.

This fixes a ValueError in `model.generate()` with transformers > 5.2.0,
after they merged:
- huggingface/transformers#43972

See related issue downstream in TRL:
- huggingface/trl#5216
- huggingface/trl#5201

## Details

Multimodal token type support:

* Added the `mm_token_type_ids` optional argument (of type
`torch.IntTensor`) to the signature of `lce_forward`, allowing for the
specification of multimodal token type IDs.
* Passed the `mm_token_type_ids` argument to the underlying model call,
ensuring it is incorporated into the forward computation.



## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

We checked this fix solves the issue downstream in TRL.

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants