Skip to content

BUGfix: Fix image_grid_thw IndexError in GRPOTrainer with Multimodal Models (Qwen3-VL) due to None Values in Chat Content#5364

Closed
SolarWindRider wants to merge 4 commits into
huggingface:mainfrom
SolarWindRider:qwen3vl
Closed

BUGfix: Fix image_grid_thw IndexError in GRPOTrainer with Multimodal Models (Qwen3-VL) due to None Values in Chat Content#5364
SolarWindRider wants to merge 4 commits into
huggingface:mainfrom
SolarWindRider:qwen3vl

Conversation

@SolarWindRider

@SolarWindRider SolarWindRider commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

Fix IndexError in GRPOTrainer with Multimodal Models due to None Values in Chat Content

Summary

This PR fixes a critical bug in GRPOTrainer that causes training to fail completely when using multimodal models (Qwen3-VL) where chat messages contain content blocks with None values—a common pattern when datasets are processed by automated pipelines.

The Problem

Severity: 🔴 Critical (Training Blocker)

When training with GRPOTrainer on Qwen3-VL, I encounter this cryptic error:

  File "/home/ma-user/work/avr/train_grpo.py", line 98, in <module>
    trainer.train()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.11/site-packages/transformers/trainer.py", line 1424, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.11/site-packages/transformers/trainer.py", line 1506, in _inner_training_loop
    self._run_epoch(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.11/site-packages/transformers/trainer.py", line 1734, in _run_epoch
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/work/trl-my/trl/trainer/grpo_trainer.py", line 1083, in training_step
    output = super().training_step(model, inputs, num_items_in_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.11/site-packages/transformers/trainer.py", line 1900, in training_step
    inputs = self._prepare_inputs(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/work/trl-my/trl/extras/profiling.py", line 202, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/work/trl-my/trl/trainer/grpo_trainer.py", line 1112, in _prepare_inputs
    generation_batch = self._generate_and_score_completions(generation_batch)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/work/trl-my/trl/trainer/grpo_trainer.py", line 1766, in _generate_and_score_completions
    ) = self._generate(prompts)
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/work/trl-my/trl/trainer/grpo_trainer.py", line 1614, in _generate
    prompt_ids, images, multimodal_fields = self._tokenize_prompts(prompts)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/work/trl-my/trl/trainer/grpo_trainer.py", line 1261, in _tokenize_prompts
    tokenized = self.processing_class.apply_chat_template(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.11/site-packages/transformers/processing_utils.py", line 1829, in apply_chat_template
    out = self(
          ^^^^^
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.11/site-packages/transformers/models/qwen3_vl/processing_qwen3_vl.py", line 132, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 2 is out of bounds for axis 0 with size 2
[ERROR] 2026-03-25-00:17:32 (PID:2357248, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception

This error occurs deep inside transformers' processing_qwen3_vl.py:

num_image_tokens = image_grid_thw[index].prod() // merge_length

Root Cause Analysis

The error message is misleading:

  1. Surface Level: The crash happens in processing_qwen3_vl.py when accessing image_grid_thw[index]
  2. Misleading: The stack trace suggests a image process bug
  3. Truth: The actual issue is in chat template rendering within trl

The debugging journey:

Through breakpoint debugging, I traced the issue to the chat template rendering step.

transformers/utils/chat_template_utils.py
line 555
            rendered_chat = compiled_template.render(
                messages=chat,
                tools=tool_schemas,
                documents=documents,
                add_generation_prompt=add_generation_prompt,
                **kwargs,
            )

# Dataset automatically add keys with None value
print(chat)
[{'content': [{'image': None, 'text': 'You are good at step by step reasoning.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=560x168 at 0xFFFC18462D90>, 'text': None, 'type': 'image'}, {'image': None, 'text': '[Logical Reasoning]  \nThe left image shows the unfolded surface of a cube-shaped box. Which option can be folded into the cube depicted?option: A,B,C,D\nWrite the answer into a JSON form\n```json\n{"answer": "X"}```', 'type': 'text'}], 'role': 'user'}]


# rendered_chat miss information,causing error above.

print(rendered_chat)
<|im_start|>system
<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
<think>



# pop all the keys with None value
print(chat2)
[{'content': [{'text': 'You are good at step by step reasoning.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=560x168 at 0xFFFC18462D90>, 'type': 'image'}, {'text': '[Logical Reasoning]  \nThe left image shows the unfolded surface of a cube-shaped box. Which option can be folded into the cube depicted?option: A,B,C,D\nWrite the answer into a JSON form\n```json\n{"answer": "X"}```', 'type': 'text'}], 'role': 'user'}]

# to get correct result
print(rendered_chat2)
<|im_start|>system
You are good at step by step reasoning.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>[Logical Reasoning]  
The left image shows the unfolded surface of a cube-shaped box. Which option can be folded into the cube depicted?option: A,B,C,D
Write the answer into a JSON form
```json
{"answer": "X"}```<|im_end|>
<|im_start|>assistant
<think>

The Fix

Filter out None values from content blocks before passing to apply_chat_template():

# Before: {'image': None, 'text': 'reasoning'} → <|image_pad|> (text lost!)
# After:  {'text': 'reasoning'}                 → "reasoning" (correct!)

Location: trl/trainer/grpo_trainer.py, line ~1709, in the input processing loop

This fix is minimal, surgical, and correct because the fix is placed at the exact location where prompts are processed, minimizing impact

Impact

  • Who it affects: Anyone using GRPOTrainer with VLM models (Qwen3-VL)
  • Why it's common: Automated data pipelines often produce None values in optional fields (e.g., {'image': None, 'text': '...', 'type': 'text'})
  • What it breaks: Complete training failure with no workaround without this fix

Testing

  • Verified fix resolves the Bug with Qwen3-VL-2B-Thinking

Note

Medium Risk
Touches the core GRPO training/eval generation path and changes the exact kwargs passed through inputs (including env reset kwargs), which could subtly affect datasets that rely on None placeholders.

Overview
Fixes multimodal GRPO training crashes by recursively removing None values from each sample in _generate_and_score_completions before building prompts and running environment resets.

This ensures chat-template rendering/tokenization doesn’t mis-handle None content blocks (e.g., VLM image/text parts), avoiding downstream processor errors like image_grid_thw index mismatches.

Written by Cursor Bugbot for commit 02a4d60. This will update automatically on new commits. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

cleaned_item = remove_empty_fields(item)
cleaned_inputs.append(cleaned_item)
prompts.append(cleaned_item["prompt"])
inputs = cleaned_inputs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broad None stripping removes top-level image keys breaking detection

High Severity

remove_empty_fields is applied to the entire input dict, not just the prompt content blocks. This strips top-level keys with None values, including "image". When inputs[0] has "image": None (a text-only sample in a mixed batch) but other inputs have actual images, the key is removed from inputs[0]. The subsequent check "image" in inputs[0] then fails, causing images = None and silently losing all images in the batch. The fix should only clean the nested prompt content, not the entire input dict.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputs[0]["image"]should be PIL and will no be removed by my changes.

print(inputs)
[{'prompt': [{'content': [{'image': None, 'text': 'You are good at step by step reasoning.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': '../datas/VisuRiddles/images/sichuan/2021_59.png', 'text': None, 'type': 'image'}, {'image': None, 'text': '[Logical Reasoning] \nThe left image shows the unfolded surface of a cube-shaped box. Which option can be folded into the cube depicted?option: A,B,C,D\nWrite the answer into a JSON form\njson\n{"answer": "X"}', 'type': 'text'}], 'role': 'user'}], 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=560x168 at 0xFFFBE5FEB5D0>, 'metadatas': {'gold_answer': 'A'}}, {'prompt': [{'content': [{'image': None, 'text': 'You are good at step by step reasoning.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': '../datas/VisuRiddles/images/sichuan/2021_59.png', 'text': None, 'type': 'image'}, {'image': None, 'text': '[Logical Reasoning] \nThe left image shows the unfolded surface of a cube-shaped box. Which option can be folded into the cube depicted?option: A,B,C,D\nWrite the answer into a JSON form\njson\n{"answer": "X"}', 'type': 'text'}], 'role': 'user'}], 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=560x168 at 0xFFFBE5FEBC90>, 'metadatas': {'gold_answer': 'A'}}]

cleaned_item = remove_empty_fields(item)
cleaned_inputs.append(cleaned_item)
prompts.append(cleaned_item["prompt"])
inputs = cleaned_inputs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix not propagated to RLOO trainer's duplicated code

Medium Severity

The remove_empty_fields logic was added only to grpo_trainer.py but not to rloo_trainer.py, which has the same duplicated _generate_and_score_completions method with the identical prompts = [x["prompt"] for x in inputs] pattern. Per project rules, changes to duplicated logic across trainers must be applied consistently to all copies.

Fix in Cursor Fix in Web

Triggered by project rule: BUGBOT.md

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the subsequent execution flow and call stack of rloo_trainer.py are exactly the same as grpo_trainer.py. So, to be safe, I will only modify the grpo_trainer that has already been tested.

@qgallouedec

Copy link
Copy Markdown
Member

Thanks, I understand the issue. However, I’m not convinced that TRL should support this kind of "polluted" dataset. It seems more appropriate for users to handle data cleaning upstream.

As a general rule of thumb, if this isn’t supported in Transformers (as indicated by the error), then it probably shouldn’t be supported in TRL either. Otherwise, we risk going down a slippery slope where supporting one such case leads to an endless stream of similar edge cases.

In this case the easiest is probably to map:

def clean_empty_images(example):
    for message in example["prompt"]:
        for element in message["content"]:
            if element["type"] == "text" and "image" in element:
                element.pop("image")
    return example

dataset = dataset.map(clean_empty_images)

@albertvillanova what do you think?

@SolarWindRider

Copy link
Copy Markdown
Contributor Author

Thanks, I understand the issue. However, I’m not convinced that TRL should support this kind of "polluted" dataset. It seems more appropriate for users to handle data cleaning upstream.

As a general rule of thumb, if this isn’t supported in Transformers (as indicated by the error), then it probably shouldn’t be supported in TRL either. Otherwise, we risk going down a slippery slope where supporting one such case leads to an endless stream of similar edge cases.

In this case the easiest is probably to map:

def clean_empty_images(example):
    for message in example["prompt"]:
        for element in message["content"]:
            if element["type"] == "text" and "image" in element:
                element.pop("image")
    return example

dataset = dataset.map(clean_empty_images)

@albertvillanova what do you think?

Thank you for your reply !

Actually, the None value "pollution" is exactlly introduced by dataset.map(). Check this out.

from transformers import AutoProcessor
from datasets import Dataset

model_name_or_path = "/home/ma-user/work/Downloads/Models/Qwen/Qwen3-VL-2B-Thinking"
processor = AutoProcessor.from_pretrained(model_name_or_path)
full_question = """What's on the image ? """

samples = [[
    {"role": "system", "content": [{"type": "text", "text": "You are good at step by step reasoning."}]},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": full_question},
        ],
    },
],[
    {"role": "system", "content": [{"type": "text", "text": "You are good at step by step reasoning."}]},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": full_question,},
        ],
    },
]]

dataset = Dataset.from_list([
    {"prompt": s}
    for s in samples
])

def clean_empty_images(example):
    for message in example["prompt"]:
        for element in message["content"]:
            if element["type"] == "text" and "image" in element:
                element.pop("image")
    return example

dataset1 = dataset.map(clean_empty_images) # dataset.map() is actually the polution source

print(dataset1[0])
"""
{'prompt': [{'content': [{'image': None, 'text': 'You are good at step by step reasoning.', 'type': 'text'}], 'role': 'system'}, {'content': [{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg', 'text': None, 'type': 'image'}, {'image': None, 'text': "What's on the image ? ", 'type': 'text'}], 'role': 'user'}]}
"""

Indeed, as a general rule of thumb, it would be better to 1) fix this dataset.map() to not generate None value keys; or 2) fix jinja2 to get correct tokenization tolerating None value keys. However, I’m not very familiar with these two libraries and am more comfortable with TRL. Given my limited expertise, the current code modification is the best solution I can come up with to help the community address this bug. If your engineers can fix dataset.map(), that would be even better for sure !

@albertvillanova albertvillanova left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging, the investigation and the proposed fix, @SolarWindRider! And thanks for the ping, @qgallouedec, really appreciate it. 🤗

This is actually a known issue coming from datasets when mixed types introduce None values.

We've run into similar problems before and added a small utility (remove_none_values) to sanitize the inputs on our side, and used it for SFT and DPO:

# Tabular backends like Arrow/Parquet insert `None` for mismatched keys in nested structures. Clean them from
# sampled data.
if isinstance(dataset, Dataset): # IterableDataset does not support `with_transform`
dataset = dataset.with_transform(remove_none_values)

That said, I have good new: this is now properly addressed upstream by datasets! Recent versions of datasets provide the Json feature type along with on_mixed_types="use_json" during mapping, which avoids introducing these Nones in the first place (available since datasets>=4.7.0).

Given that, it might be cleaner to rely on the upstream fix rather than maintaining workarounds on our end. I’m thinking we could pin datasets to a compatible version: I’ll open a small PR for that so we can discuss.

@SolarWindRider

Copy link
Copy Markdown
Contributor Author

Good to know! Im closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants