Add the support for more VLMs(Gemma3 and InternVL)#2327
Add the support for more VLMs(Gemma3 and InternVL)#2327Qsingle wants to merge 2 commits intoverl-project:mainfrom
Conversation
add support for gemma3 and internvl for grpo training
2025-07-04 17:06:43,418 INFO cli.py:88 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 355, in generate_sequences
response_attention_mask = get_response_mask(
File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 242, in get_response_mask
eos_mask = torch.isin(response_id, torch.tensor(eos_token, device=response_id.device)).int()
RuntimeError: Could not infer dtype of NoneType
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.I encountered this error while using this PR to train InternVL3. Do you have any suggestions? |
|
LoRA training will report an error, and it needs to be fixed like this below # verl/utils/fsdp_utils.py + 90
default_transformer_cls_names_to_wrap = getattr(module, "_no_split_modules", None)
if re.match("internvl", module.__class__.__name__, re.IGNORECASE) or (module.__class__.__name__ == 'PeftModelForCausalLM' and re.match("internvl", module.base_model.model.__class__.__name__, re.IGNORECASE)):
update_cls_names_to_wrap = []
for mod in default_transformer_cls_names_to_wrap:
if mod != "LlamaDecoderLayer":
update_cls_names_to_wrap.append(mod)
default_transformer_cls_names_to_wrap = update_cls_names_to_wrap
elif re.match("gemma3", module.__class__.__name__, re.IGNORECASE) or (module.__class__.__name__ == 'PeftModelForCausalLM' and re.match("gemma3", module.base_model.model.__class__.__name__, re.IGNORECASE)):
update_cls_names_to_wrap = []
for mod in default_transformer_cls_names_to_wrap:
if mod != "SiglipMultiheadAttentionPoolingHead":
update_cls_names_to_wrap.append(mod)
default_transformer_cls_names_to_wrap = update_cls_names_to_wrap
fsdp_transformer_layer_cls_to_wrap = _get_attr(
"transformer_layer_cls_to_wrap", default_transformer_cls_names_to_wrap
) |
meta_info = {
"eos_token_id": self.generation_config.eos_token_id
if getattr(self.generation_config, "eos_token_id", None) is not None
else self.tokenizer.eos_token_id,
"pad_token_id": self.generation_config.pad_token_id
if getattr(self.generation_config, "pad_token_id", None) is not None
else self.tokenizer.pad_token_id,
}Seems that |
Could you provide the script you used to train the InternVL3? |
ray job submit --address=${RAY_ADDRESS} \
-- python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=[${CURRENT_PATH}/verl_data_with_gt/math_pkg_250701.json_geo3k_acc.parquet] \
data.val_files=${CURRENT_PATH}/verl_data/geo3k/test.parquet \
data.train_batch_size=${ROLLOUT_BATCH_SIZE} \
data.max_prompt_length=18432 \
data.max_response_length=32768 \
data.filter_overlong_prompts=True \
data.filter_overlong_prompts_workers=8 \
data.truncation='error' \
data.image_key=images \
data.trust_remote_code=True \
actor_rollout_ref.model.path=${CURRENT_PATH}/pretrained/InternVL3-1B-64K \
actor_rollout_ref.model.trust_remote_code=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.kl_loss_coef=0.0 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${MICRO_ROLLOUT_BATCH_SIZE} \
actor_rollout_ref.rollout.tensor_model_parallel_size=${TENSOR_PARALLEL} \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.free_cache_engine=False \
actor_rollout_ref.rollout.n=${N_SAMPLES_PER_PROMPT} \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
actor_rollout_ref.actor.loss_agg_mode=token-mean \
algorithm.use_kl_in_reward=False \
algorithm.kl_ctrl.kl_coef=0.0 \
trainer.critic_warmup=0 \
trainer.default_local_dir=${OUTPUT_PATH} \
trainer.logger=['console','tensorboard'] \
trainer.project_name=${PROJECT_NAME} \
trainer.experiment_name=${TASK_NAME} \
trainer.n_gpus_per_node=${NPROC_PER_NODE} \
trainer.nnodes=${WORLD_SIZE} \
trainer.save_freq=20 \
trainer.test_freq=5000 \
trainer.val_before_train=False \
trainer.rollout_data_dir=${OUTPUT_PATH}/rollouts \
trainer.total_epochs=100 2>&1 | tee ${JOBLOG}BTW, I encountered another error when setting �[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/ray/base.py", line 710, in func
�[36m(TaskRunner pid=5245)�[0m return getattr(self.worker_dict[key], name)(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/base/decorator.py", line 549, in inner
�[36m(TaskRunner pid=5245)�[0m return func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/fsdp_workers.py", line 802, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 81, in f
�[36m(TaskRunner pid=5245)�[0m return self.log(decorated_function, *args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 94, in log
�[36m(TaskRunner pid=5245)�[0m output = func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 364, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m entropy, log_probs = self._forward_micro_batch(
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 197, in _forward_micro_batch
�[36m(TaskRunner pid=5245)�[0m log_probs = logprobs_from_logits(
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 87, in logprobs_from_logits
�[36m(TaskRunner pid=5245)�[0m output = logprobs_from_logits_flash_attn(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 97, in logprobs_from_logits_flash_attn
�[36m(TaskRunner pid=5245)�[0m output = cross_entropy_loss(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss
�[36m(TaskRunner pid=5245)�[0m return CrossEntropyLoss.apply(
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
�[36m(TaskRunner pid=5245)�[0m return super().apply(*args, **kwargs) # type: ignore[misc]
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 170, in forward
�[36m(TaskRunner pid=5245)�[0m assert labels.shape == (n_rows,)
�[36m(TaskRunner pid=5245)�[0m AssertionError |
|
|
||
| __all__ = ["Gemma3Preprocessor"] | ||
|
|
||
| @PREPROCESSOR_REGISTER.register() |
There was a problem hiding this comment.
BTW I am thinking moving all model related code to the same folder, one per model. #2338 (review)
Given the complexity of multimodal structures, i think it's worth a RFC for the overall approach and design
There was a problem hiding this comment.
Yeah, I think it is a good strategy for the Multi-modality framework.
Thanks for your feedback. I will try to resolve this problem. |
There was a problem hiding this comment.
it looks like the "model_init_kwargs" isn't used
There was a problem hiding this comment.
Sorry, I forgot to use it in current version.
There was a problem hiding this comment.
hello, I have one question here. I didn't see any code for internVL model for monkey path here. Does that mean InternVL do not require custom code or sequence parallel is not applicaple for InternVL now?
Thanks a lot!
There was a problem hiding this comment.
InternVL does not have a special design that requires monkey patching. However, the vision model of InternVL does generate a high memory cost. For example, InternVL-Chat-V1.5, a 26B model, requires about 50G of memory for model parameters in BF16 format, and considering the additional overhead during training, it requires around 100-150G. The special requirement for vision encoder may need some discussion.
|
Has anyone successfully merged the trained fsdp model into a huggingface model? I try using |
Yeah, some code modifications are necessary to provide support. |
|
I have fixed the mere problem, we need to modify the merge class BaseModelMerger(ABC):
......
elif "ForConditionalGeneration" in self.model_config.architectures[0]:
return AutoModelForVision2Seq
+ elif "InternVLChatModel" in self.model_config.architectures[0]:
+ return AutoModel
raise NotImplementedError(f"Unknown architecture {self.model_config.architectures}")Besides, I also find in lastes transformers, we need modify tokenizer in the tokenizer.context_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.context_image_token) #for transformers >= 4.52.2
tokenizer.start_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.start_image_token) #for transformers >= 4.53.2
tokenizer.end_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.end_image_token) #for transformers >= 4.53.2 |
|
Thank you for your work!
Thank you very much. |
|
I encountered an error while running internvl3: "Only support config type of {'deepseek_v3', 'minicpmo', 'qwen2_5_vl', 'qwen3_moe', 'qwen3', 'minicpmv', 'llama', 'qwen2', 'qwen2_vl'}, but got internvl_chat. MFU will always be zero." Could you please provide me with some guidance? |
I think this is only warning to print some infomation such as "MFU", in my case,it works well. |
|
|
How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it. |
This is the error I encountered: |
You could add the following code before |
Thank you for your reply. After I merged the weights using this method, I encountered an error when loading the model using VLLM: ValueError: There is no module or parameter named 'lm_head' in Gemma3ForConditionalGeneration vllm=0.8.2 transformers=4.52.2下方是我的加载脚本: if name == 'main': |
rich-junwang
left a comment
There was a problem hiding this comment.
Thanks for the PR. It would be nice to add some unit tests for the data preprocessing part.
| } | ||
| """ | ||
|
|
||
| PREPROCESSOR_REGISTER.register() |
| def process_image(self, image, **kwargs): | ||
| if isinstance(image, Image.Image): | ||
| image_obj = image | ||
| elif image.startswith("http://") or image.startswith("https://"): | ||
| # fix memory leak issue while using BytesIO | ||
| with requests.get(image, stream=True) as response: | ||
| response.raise_for_status() | ||
| with BytesIO(response.content) as bio: | ||
| image_obj = copy.deepcopy(Image.open(bio)) | ||
| elif image.startswith("file://"): | ||
| image_obj = Image.open(image[7:]) | ||
| elif image.startswith("data:image"): | ||
| if "base64," in image: | ||
| _, base64_data = image.split("base64,", 1) | ||
| data = base64.b64decode(base64_data) | ||
| # fix memory leak issue while using BytesIO | ||
| with BytesIO(data) as bio: | ||
| image_obj = copy.deepcopy(Image.open(bio)) | ||
| else: | ||
| image_obj = Image.open(image) | ||
| return image_obj.convert("RGB") |
There was a problem hiding this comment.
Would that be possible to create some kind of mixin class to handle the duplicate code? such as:
class MediaProcessingMixin:
"""Mixin providing common media processing functionality"""
def _process_image_from_source(self, image, **kwargs):
"""Shared image processing logic"""
if isinstance(image, Image.Image):
image_obj = image
elif image.startswith("http://") or image.startswith("https://"):
with requests.get(image, stream=True) as response:
response.raise_for_status()
with BytesIO(response.content) as bio:
image_obj = copy.deepcopy(Image.open(bio))
elif image.startswith("file://"):
image_obj = Image.open(image[7:])
elif image.startswith("data:image"):
if "base64," in image:
_, base64_data = image.split("base64,", 1)
data = base64.b64decode(base64_data)
with BytesIO(data) as bio:
image_obj = copy.deepcopy(Image.open(bio))
else:
image_obj = Image.open(image)
return image_obj.convert("RGB")
# Now each preprocessor can inherit from both the base class AND the mixin
class Gemma3Preprocessor(BasicPreprocessor, MediaProcessingMixin):
def process_image(self, image, **kwargs):
return self._process_image_from_source(image, **kwargs)
class InternVLPreprocessor(BasicPreprocessor, MediaProcessingMixin):
def process_image(self, image, **kwargs):
return self._process_image_from_source(image, **kwargs)
class KimiVLPreprocessor(BasicPreprocessor, MediaProcessingMixin):
def process_image(self, image, **kwargs):
return self._process_image_from_source(image, **kwargs)There was a problem hiding this comment.
Thanks for your advice, I will solve this.
| # TODO: add more optimizer args into config | ||
| if role == "actor" and optim_config is not None: | ||
| from verl.utils.torch_functional import get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup | ||
| optim_strategy = optim_config.get("strategy", "adamw") |
There was a problem hiding this comment.
This line is not used. Any particular reason we keep this? If not, it would be nice to remove it.
|
|
||
| log_gpu_memory_usage("After critic FSDP", logger=None) | ||
|
|
||
| optim_strategy = config.optim.get("strategy", "adamw") |
There was a problem hiding this comment.
Okay, this line is used to set the strategy of optimizer. However, I do not include the implementation of this in the current version.
|
Hi @Qsingle, Our team is very interested in using the InternVL3 series and would love to align with your plans. Could you kindly let us know if this PR is still expected to be merged? If so, do you have a rough timeline or schedule for the next steps? If there are no current plans to support InternVL3.5 in the near term, we may consider implementing support for it in our private repository around Q4 2025. To avoid duplicated effort, we’d really appreciate the chance to coordinate with you—if possible, we’d be happy to collaborate or contribute back to the main codebase. Thanks so much for your time and guidance! |
I'm converting this PR to a Draft. After reviewing the initial implementation, I've identified some necessary adjustments to better integrate Multimodal Large-Language Models (MLLMs). I will push updates once the new approach is ready. In the meantime, I welcome any thoughts or discussion on the best way to add MLLM support to the VeRL framework. |
Checklist Before Starting
What does this PR do?
High-Level Design
Specific Changes
API
Usage Example
Test
The training curve for InternVl2.5-1B

The training curve for InternVL3-1B

Additional Info.
Checklist Before Submitting
[BREAKING]to the PR title if it breaks any API.