Skip to content

[2/3] Refactor InternVL-based processors#37324

Merged
Isotr0py merged 13 commits intovllm-project:mainfrom
DarkLight1337:refactor-internvl
Mar 18, 2026
Merged

[2/3] Refactor InternVL-based processors#37324
Isotr0py merged 13 commits intovllm-project:mainfrom
DarkLight1337:refactor-internvl

Conversation

@DarkLight1337
Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 commented Mar 17, 2026

Purpose

Follow-up to #37289

  • Split up processing logic into *ImageProcessor and *VideoProcessor.
  • Remove unnecessary processor definitions for Eagle2.5 and SkyworkR1V
  • Init processor directly instead of using self.ctx.init_processor, in order to avoid extra kwargs causing error.

Note: Nemotron Parse and Nano Nemotron VL's processor will be handled in a separate PR.

Test Plan

  • Checked python examples/offline_inference/vision_language.py (except for NVLM_D and Skywork which are too big to load in memory)
  • Checked python examples/offline_inference/vision_language_multi_image.py for InternVL

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the InternVL-based processors. By centralizing common logic into a new InternVLProcessor and its related helper classes, the changes greatly improve code maintainability and reduce duplication. The new structure is more modular and easier to follow. I've identified one minor issue where a method is called redundantly, which I've commented on. Overall, this is a high-quality contribution.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the InternVL-based processors by splitting the logic into separate *ImageProcessor and *VideoProcessor classes. This is a good architectural improvement that centralizes common logic and removes redundant processor definitions for models like Eagle2.5 and SkyworkR1V. The changes are extensive but consistent across multiple model files. I've found one critical bug in the new InternVLProcessor that could lead to a runtime error.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify bot added the qwen Related to Qwen models label Mar 18, 2026
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026
@DarkLight1337 DarkLight1337 marked this pull request as ready for review March 18, 2026 04:24
@DarkLight1337 DarkLight1337 requested a review from Isotr0py March 18, 2026 04:24
@DarkLight1337
Copy link
Copy Markdown
Member Author

/gemini review

@DarkLight1337 DarkLight1337 changed the title [2/2] Refactor InternVL-based processors [2/3] Refactor InternVL-based processors Mar 18, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the processor logic for several InternVL-based models, moving from model-specific processors to a more generic and reusable InternVLProcessor framework. This is a significant improvement for code maintainability and consistency. The changes are well-executed across multiple models. I've found one critical bug in the NVLMProcessor logic that could lead to an incorrect number of image tokens being generated.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Mar 18, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
if video_token not in self.get_tokenizer().get_vocab():
return None

return video_token
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used? I think we have unified all video tokens to use video_token = "<video>" in new processor?

Copy link
Copy Markdown
Member Author

@DarkLight1337 DarkLight1337 Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this should be effectively ctx_video_token (the token after replacement), let me rename it. It is not to be confused with the <video> placeholder (before replacement).


while "<placeholder>" in new_prompt:
replace_str = replace_strings.pop(0)
new_prompt = new_prompt.replace("<placeholder>", replace_str, 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since image_token and video_token are both different from ctx_image_token, I think we can direclty replace image/video token instead of using "<placeholder>" as intermediary.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nemotron VL uses <image> as ctx_image_token so we need to use a different one.

@Isotr0py
Copy link
Copy Markdown
Member

Checked python examples/offline_inference/vision_language.py (except for NVLM_D and Skywork which are too big to load in memory)

I can help check NVLM_D and Skywork tonight.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both NVLM_D and Skywork work, LGTM.

$ python examples/offline_inference/vision_language.py -m NVLM_D
INFO 03-18 14:06:53 [utils.py:233] non-default args: {'trust_remote_code': True, 'max_model_len': 4096, 'tensor_parallel_size': 4, 'limit_mm_per_prompt': {'image': 1, 'video': 0, 'audio': 0, 'vision_chunk': 0}, 'model': 'nvidia/NVLM-D-72B'}
INFO 03-18 14:06:54 [model.py:533] Resolved architecture: NVLM_D
INFO 03-18 14:06:54 [model.py:1582] Using max model len 4096
INFO 03-18 14:06:54 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 03-18 14:06:54 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=2436208) INFO 03-18 14:06:58 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev60+g17c47fb86) with config: model='nvidia/NVLM-D-72B', speculative_config=None, tokenizer='nvidia/NVLM-D-72B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/NVLM-D-72B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
...
Rendering prompts: 100%|██████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 15.86it/s]
Processed prompts: 100%|█████████████████| 4/4 [00:01<00:00,  2.06it/s, est. speed input: 3841.81 toks/s, output: 131.62 toks/s]
--------------------------------------------------
The image portrays a cherry blossom tree in full bloom, with numerous pink flowers adorning its branches. The tree is positioned in front of a tall, white tower, which serves as a backdrop. The sky is clear and blue, providing a vibrant contrast to the pink blossoms and the tower. The cherry blossom tree is
--------------------------------------------------
The image features a tall, white tower with a distinctive design, surrounded by cherry blossom trees in full bloom. The cherry blossoms, with their pink flowers, create a beautiful contrast against the blue sky. The tower, known as the Tokyo Tower, is a famous landmark in Japan, often associated with the cherry blossom season
--------------------------------------------------
The image depicts a tall, white tower with a lattice structure, partially obscured by cherry blossom trees in full bloom. The cherry blossoms are pink and cover most of the frame, with the blue sky serving as the background. The tower, which is the focal point of the image, is framed by the cherry blossom branches
--------------------------------------------------
The image depicts a cherry blossom tree in full bloom, with numerous pink flowers adorning its branches. The blossoms are in various stages of blooming, creating a vibrant and picturesque scene. The tree's branches extend across the frame, with some reaching towards the top of the image. The sky in the background is a
--------------------------------------------------
$ python examples/offline_inference/vision_language.py -m skywork_chat
INFO 03-18 14:14:45 [model.py:533] Resolved architecture: SkyworkR1VChatModel
INFO 03-18 14:14:45 [model.py:1582] Using max model len 4096
INFO 03-18 14:14:45 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 03-18 14:14:45 [vllm.py:754] Asynchronous scheduling is enabled.
generation_config.json: 100%|██████████████████████████████████████████████████████████████████| 181/181 [00:00<00:00, 2.59MB/s]
(EngineCore pid=2441252) INFO 03-18 14:14:47 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev60+g17c47fb86) with config: model='Skywork/Skywork-R1V-38B', speculative_config=None, tokenizer='Skywork/Skywork-R1V-38B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Skywork/Skywork-R1V-38B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
...
--------------------------------------------------
Alright, let's take a look at this image. It seems to be a beautiful scene with cherry blossoms in the foreground and a tall tower in the background. The cherry blossoms are in full bloom, with their delicate pink flowers creating a soft, ethereal atmosphere. The tower appears to be a well-known landmark
--------------------------------------------------
Alright, so I'm looking at this image, and I want to figure out what's going on here. Let me start by breaking it down. The image is a close-up of cherry blossoms, which are in full bloom. The flowers are pink and delicate, and they're spread out across the branches of a
--------------------------------------------------
Alright, so I'm looking at this image, and I want to figure out what it's showing. Let me start by breaking it down. The first thing I notice is the abundance of pink flowers. They seem to be cherry blossoms, which are pretty common in springtime, especially in places like Japan. The
--------------------------------------------------
Alright, so I'm looking at this image, and I need to figure out what's going on. Let me start by breaking it down. The image is a close-up of a cherry blossom tree with lots of pink flowers. The branches are in the foreground, and through them, I can see a tall, cylindrical
--------------------------------------------------

@Isotr0py Isotr0py merged commit 99267c2 into vllm-project:main Mar 18, 2026
57 checks passed
@DarkLight1337 DarkLight1337 deleted the refactor-internvl branch March 18, 2026 14:42
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants