:rotating_light: Unify 3D position ids by zucchini-nlp · Pull Request #43972 · huggingface/transformers

zucchini-nlp · 2026-02-13T09:31:44Z

What does this PR do?

Following Ernie, we build 3d positions based on mm_token_type_ids and the models will return them by default from processor.

We have a unified get_vision_position in the qwen2-vl model file, all other models just copy it from there. The utility build vision ids as the name suggests, and the models are free to manipulate on top as they wish. In most cases, the only thing that changes is the presence of new modalities or kwargs

HuggingFaceDocBuilderDev · 2026-02-17T11:34:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-02-17T12:03:33Z

run-slow: ernie4_5_vl_moe, qwen2_5_vl, qwen2_vl

github-actions · 2026-02-17T12:04:43Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/qwen2_5_vl", "models/qwen2_vl"]
quantizations: []

github-actions · 2026-02-17T13:11:44Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	de19dc95	workflow commit (merge commit)
PR	c066f48b	branch commit (from PR)
main	cfef7f14	base commit (on `main`)

Model CI Report

❌ 11 new failed tests from this PR 😭

ernie4_5_vl_moe:
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test (✅ ⟹ ❌)
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch (✅ ⟹ ❌)
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch_different_resolutions (✅ ⟹ ❌)
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch_wo_image (✅ ⟹ ❌)
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_expand (✅ ⟹ ❌)
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_with_video (✅ ⟹ ❌)
qwen2_5_vl:
tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py::Qwen2_5_VLIntegrationTest::test_small_model_integration_test_with_video (❌ ⟹ ❌)
tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py::Qwen2_5_VLProcessorTest::test_apply_chat_template_video_frame_sampling (✅ ⟹ ❌)
tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py::Qwen2_5_VLProcessorTest::test_model_input_names (✅ ⟹ ❌)
qwen2_vl:
tests/models/qwen2_vl/test_processing_qwen2_vl.py::Qwen2VLProcessorTest::test_apply_chat_template_video_frame_sampling (✅ ⟹ ❌)
tests/models/qwen2_vl/test_processing_qwen2_vl.py::Qwen2VLProcessorTest::test_model_input_names (✅ ⟹ ❌)

zucchini-nlp · 2026-02-17T14:49:22Z

run-slow: glm46v, glm4v, glm4v_moe, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

github-actions · 2026-02-17T14:50:36Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe"]
quantizations: []

zucchini-nlp · 2026-02-17T17:16:17Z

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

zucchini-nlp · 2026-02-17T17:20:32Z

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

zucchini-nlp · 2026-02-17T17:23:24Z

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

github-actions · 2026-02-17T22:12:13Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	0bd60dc6	workflow commit (merge commit)
PR	e34bc8ad	branch commit (from PR)
main	66b40e69	base commit (on `main`)

⚠️ No test being reported (jobs are skipped or cancelled)!

github-actions · 2026-02-17T22:12:50Z

Workflow Run ⚙️💔 This comment contains run-slow, but unknown error occurred and the workflow run aborted!

zucchini-nlp · 2026-02-18T09:05:36Z

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

github-actions · 2026-02-18T09:06:50Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe"]
quantizations: []

github-actions · 2026-02-18T19:09:09Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	477a82b0	workflow commit (merge commit)
PR	7a4df899	branch commit (from PR)
main	ffbea2db	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

zucchini-nlp · 2026-02-19T10:13:31Z

The slow CI takes really long now, it took the whole work-day for 13 models 🙃

zucchini-nlp · 2026-02-19T11:37:24Z

run-slow: qwen3_vl, glm_image, qwen2_5_vl, glm4v, ernie_4_5vl_moe

github-actions · 2026-02-19T11:38:36Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/glm4v", "models/glm_image", "models/qwen2_5_vl", "models/qwen3_vl"]
quantizations: []

…ing start-end tokens

zucchini-nlp · 2026-02-20T11:14:40Z

run-slow: glm4v

github-actions · 2026-02-20T11:15:56Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/glm4v"]
quantizations: []

github-actions · 2026-02-20T13:36:39Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	18899f4e	workflow commit (merge commit)
PR	7d34fbcd	branch commit (from PR)
main	00cc937c	base commit (on `main`)

Model CI Report

❌ 33 new failed tests from this PR 😭

glm4v:
tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test (✅ ⟹ ❌)
tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_batch (✅ ⟹ ❌)
tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_batch_different_resolutions (✅ ⟹ ❌)
tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_batch_wo_image (✅ ⟹ ❌)
tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_expand (✅ ⟹ ❌)
tests/models/glm4v/test_modeling_glm4v.py::Glm4vIntegrationTest::test_small_model_integration_test_with_video (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_assistant_mask (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_decoded_video_0 (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_image_0 (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_image_1 (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_video_0 (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_video_1 (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_apply_chat_template_video_frame_sampling (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_get_num_multimodal_tokens_matches_processor_call (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_image_processor_defaults_preserved_by_image_kwargs (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_kwargs_overrides_default_image_processor_kwargs (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_kwargs_overrides_default_tokenizer_kwargs (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_kwargs_overrides_default_tokenizer_kwargs_video (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_kwargs_overrides_default_video_processor_kwargs (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_model_input_names (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_processor_text_has_no_visual (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_processor_with_multiple_inputs (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_structured_kwargs_nested (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_structured_kwargs_nested_from_dict (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_structured_kwargs_nested_from_dict_video (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_structured_kwargs_nested_video (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_tokenizer_defaults_preserved_by_kwargs (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_tokenizer_defaults_preserved_by_kwargs_video (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_unstructured_kwargs (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_unstructured_kwargs_batched (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_unstructured_kwargs_batched_video (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_unstructured_kwargs_video (✅ ⟹ ❌)
tests/models/glm4v/test_processor_glm4v.py::Glm4vProcessorTest::test_video_processor_defaults_preserved_by_video_kwargs (✅ ⟹ ❌)

zucchini-nlp · 2026-02-20T13:49:12Z

retriggering later, it took a bad commit to run CI!

vasqu · 2026-02-20T14:07:55Z

No worries, should I still take a look?

zucchini-nlp · 2026-02-20T14:16:14Z

Yes please, the rest is to adjust processor tests and re-run slow CI

vasqu

Overall LGTM, just added a few smaller questions. Something that we could further improve is to not iterate over each sample within the batch but dw, probably better suited for a future PR that builds on this

(trusting that the tests will be fixed :D)

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

vasqu · 2026-02-20T17:52:52Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+        }
+
+        for batch_idx, current_input_ids in enumerate(input_ids):
+            input_token_type = mm_token_type_ids[batch_idx]


Now that we require mm_token_type_ids I suspect this might be breaking for a lot of outside users?

Unsure how many really use our internals here tbh, would still mark it tho with 🚨 how does vllm handle this? Do they rely on our processors or do they provide the position ids on their own?

Good point, vllm indeed does call get_rope_index. I will coordinate with Harry on that

vasqu · 2026-02-20T17:56:07Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+            2: iter(video_grid_thw) if video_grid_thw is not None else None,
+        }
+
+        for batch_idx, current_input_ids in enumerate(input_ids):


I think this can also be improved although not super easy. We would need to track offsets per batch sequences somehow and remove them from the found groups respectively. (tl;dr: One long flattened sequence to iterate over instead of per batch sample)

If you have the motivation, would be nice but not essential to this PR - I'd rather focus on the unification first.

Hm, I am not sure if that will be justified since usually we get relatively small batch sizes, and iterating over each shouldn't need much time

I will see what we can do after merging the PR, but I'll put it low prio for now

Yea no worries, just a thought on improvement since it could add up when we want to improve on big batch inference.

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

vasqu · 2026-02-20T18:02:10Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

-            model_kwargs.get("image_grid_thw") is not None or model_kwargs.get("video_grid_thw") is not None
+        if (
+            is_input_ids
+            and model_kwargs.get("mm_token_type_ids") is not None


Do we really want to be that strict? We could also init with zeros otherwise? Unaware whether this would bring more silent issues instead

I think without mm_token_types we will end up with garbage positions in any case. I added this line so we don't have to check inside rope_index and early return

This way, we fallback to basic incremental position ids from generate utils, and we don't have to duplocate that code inside get_rope_index

src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py

vasqu · 2026-02-20T18:07:37Z

src/transformers/models/glm4v/modular_glm4v.py

+                    # GLM4V splits video into segments per frame but there's only one `grid_thw`
+                    # per whole video. We can't exhaus the iterator and have to re-use the grid
+                    # while processing the same video!
+                    if modality_type == 2:
+                        if video_group_index == 0:
+                            grid_thw = next(grid_iters[modality_type])
+                        video_group_index += 1
+                        video_group_index = 0 if video_group_index >= grid_thw[0] else video_group_index
+                    else:
+                        grid_thw = next(grid_iters[modality_type])


This feels like an opportunity for a helper function? So much is the same, it's a shame if we could not use modular a bit more

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

zucchini-nlp · 2026-02-23T11:39:51Z

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe, video_llama_3

github-actions · 2026-02-23T11:41:03Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/paddleocr_vl", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_vl", "models/qwen3_vl_moe", "models/video_llama_3"]
quantizations: []

github-actions · 2026-02-23T16:29:42Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	cc65d857	workflow commit (merge commit)
PR	42a886f3	branch commit (from PR)
main	a3dcad9e	base commit (on `main`)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

zucchini-nlp · 2026-02-24T11:03:28Z

Same tests failing as in main for slow CI, verified manually with the latest run. I'll merge when vLLM is oke with this change

github-actions · 2026-02-24T11:03:54Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe, video_llama_3

tomaarsen · 2026-03-05T18:48:30Z

This PR is introducing a regression for video inputs for Qwen models, see #44479

Tom Aarsen

#1120) ## Summary This PR fixes the `lce_forward` function for VL models, adding support for `mm_token_type_ids` optional parameter related to multimodal processing: - `glm4v` - `glm4v_moe` - `qwen2_vl` - `qwen2_5_vl` - `qwen3_vl` - `qwen3_vl_moe` Fix #1117. This fixes a ValueError in `model.generate()` with transformers > 5.2.0, after they merged: - huggingface/transformers#43972 See related issue downstream in TRL: - huggingface/trl#5216 - huggingface/trl#5201 ## Details Multimodal token type support: * Added the `mm_token_type_ids` optional argument (of type `torch.IntTensor`) to the signature of `lce_forward`, allowing for the specification of multimodal token type IDs. * Passed the `mm_token_type_ids` argument to the underlying model call, ensuring it is incorporated into the forward computation. ## Testing Done   We checked this fix solves the issue downstream in TRL. - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

zucchini-nlp added 4 commits January 26, 2026 09:52

draft before I lose it

3902c69

dump and come back after position ids PR is merged

9f1f3c9

merge main

b947c42

fix fast tests

b236d7a

let qwen-vl return mm-token-type-ids always!

c066f48

zucchini-nlp added 2 commits February 17, 2026 14:53

fix repo

a88c983

fix fast tests

e34bc8a

this should be it!

034129d

zucchini-nlp changed the title ~~[WIP] Unify 3D position ids~~ Unify 3D position ids Feb 17, 2026

Merge branch 'main' into position-ids-qwen

7a4df89

zucchini-nlp added 2 commits February 19, 2026 11:42

fix style

3648d56

oops fix this one is swapped

1eb0252

zucchini-nlp requested a review from vasqu February 19, 2026 11:39

glm uses the same token id for images and videos, workaround by check…

7d34fbc

…ing start-end tokens

oh no, modular deleted my fix

985c8d1

vasqu approved these changes Feb 20, 2026

View reviewed changes

zucchini-nlp changed the title ~~Unify 3D position ids~~ :rotating_lights: Unify 3D position ids Feb 23, 2026

zucchini-nlp changed the title ~~:rotating_lights: Unify 3D position ids~~ 🚨 Unify 3D position ids Feb 23, 2026

zucchini-nlp added 2 commits February 23, 2026 11:25

add docs

47298c0

fix processor

42a886f

zucchini-nlp mentioned this pull request Feb 23, 2026

[BugFix] Fix 3D rope in transformers backend vllm-project/vllm#35097

Merged

tarekziade mentioned this pull request Feb 23, 2026

CI: failing slow runs fails to report properly #44238

Closed

Merge branch 'main' into position-ids-qwen

bd22832

zucchini-nlp merged commit c281a2d into huggingface:main Feb 24, 2026
25 checks passed

hmellor mentioned this pull request Feb 24, 2026

Update to transformers v5 vllm-project/vllm#30566

Open

albertvillanova mentioned this pull request Mar 4, 2026

ValueError: The following model_kwargs are not used by the model: ['mm_token_type_ids'] huggingface/trl#5216

Open

2 tasks

Conversation

zucchini-nlp commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 17, 2026

Uh oh!

zucchini-nlp commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

CI Results

Commit Info

Model CI Report

Uh oh!

zucchini-nlp commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

zucchini-nlp commented Feb 17, 2026

Uh oh!

zucchini-nlp commented Feb 17, 2026

Uh oh!

zucchini-nlp commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

CI Results

Commit Info

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

zucchini-nlp commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

CI Results

Commit Info

Uh oh!

zucchini-nlp commented Feb 19, 2026

Uh oh!

zucchini-nlp commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

zucchini-nlp commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

CI Results

Commit Info

Model CI Report

Uh oh!

zucchini-nlp commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Feb 20, 2026

Uh oh!

zucchini-nlp commented Feb 20, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Feb 13, 2026 •

edited

Loading

zucchini-nlp commented Feb 20, 2026 •

edited

Loading