[2/3] Refactor InternVL-based processors#37324
Conversation
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the InternVL-based processors. By centralizing common logic into a new InternVLProcessor and its related helper classes, the changes greatly improve code maintainability and reduce duplication. The new structure is more modular and easier to follow. I've identified one minor issue where a method is called redundantly, which I've commented on. Overall, this is a high-quality contribution.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the InternVL-based processors by splitting the logic into separate *ImageProcessor and *VideoProcessor classes. This is a good architectural improvement that centralizes common logic and removes redundant processor definitions for models like Eagle2.5 and SkyworkR1V. The changes are extensive but consistent across multiple model files. I've found one critical bug in the new InternVLProcessor that could lead to a runtime error.
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the processor logic for several InternVL-based models, moving from model-specific processors to a more generic and reusable InternVLProcessor framework. This is a significant improvement for code maintainability and consistency. The changes are well-executed across multiple models. I've found one critical bug in the NVLMProcessor logic that could lead to an incorrect number of image tokens being generated.
| if video_token not in self.get_tokenizer().get_vocab(): | ||
| return None | ||
|
|
||
| return video_token |
There was a problem hiding this comment.
Is this still used? I think we have unified all video tokens to use video_token = "<video>" in new processor?
There was a problem hiding this comment.
Actually this should be effectively ctx_video_token (the token after replacement), let me rename it. It is not to be confused with the <video> placeholder (before replacement).
|
|
||
| while "<placeholder>" in new_prompt: | ||
| replace_str = replace_strings.pop(0) | ||
| new_prompt = new_prompt.replace("<placeholder>", replace_str, 1) |
There was a problem hiding this comment.
Since image_token and video_token are both different from ctx_image_token, I think we can direclty replace image/video token instead of using "<placeholder>" as intermediary.
There was a problem hiding this comment.
Nemotron VL uses <image> as ctx_image_token so we need to use a different one.
I can help check NVLM_D and Skywork tonight. |
Isotr0py
left a comment
There was a problem hiding this comment.
Both NVLM_D and Skywork work, LGTM.
$ python examples/offline_inference/vision_language.py -m NVLM_D
INFO 03-18 14:06:53 [utils.py:233] non-default args: {'trust_remote_code': True, 'max_model_len': 4096, 'tensor_parallel_size': 4, 'limit_mm_per_prompt': {'image': 1, 'video': 0, 'audio': 0, 'vision_chunk': 0}, 'model': 'nvidia/NVLM-D-72B'}
INFO 03-18 14:06:54 [model.py:533] Resolved architecture: NVLM_D
INFO 03-18 14:06:54 [model.py:1582] Using max model len 4096
INFO 03-18 14:06:54 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 03-18 14:06:54 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=2436208) INFO 03-18 14:06:58 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev60+g17c47fb86) with config: model='nvidia/NVLM-D-72B', speculative_config=None, tokenizer='nvidia/NVLM-D-72B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/NVLM-D-72B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
...
Rendering prompts: 100%|██████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 15.86it/s]
Processed prompts: 100%|█████████████████| 4/4 [00:01<00:00, 2.06it/s, est. speed input: 3841.81 toks/s, output: 131.62 toks/s]
--------------------------------------------------
The image portrays a cherry blossom tree in full bloom, with numerous pink flowers adorning its branches. The tree is positioned in front of a tall, white tower, which serves as a backdrop. The sky is clear and blue, providing a vibrant contrast to the pink blossoms and the tower. The cherry blossom tree is
--------------------------------------------------
The image features a tall, white tower with a distinctive design, surrounded by cherry blossom trees in full bloom. The cherry blossoms, with their pink flowers, create a beautiful contrast against the blue sky. The tower, known as the Tokyo Tower, is a famous landmark in Japan, often associated with the cherry blossom season
--------------------------------------------------
The image depicts a tall, white tower with a lattice structure, partially obscured by cherry blossom trees in full bloom. The cherry blossoms are pink and cover most of the frame, with the blue sky serving as the background. The tower, which is the focal point of the image, is framed by the cherry blossom branches
--------------------------------------------------
The image depicts a cherry blossom tree in full bloom, with numerous pink flowers adorning its branches. The blossoms are in various stages of blooming, creating a vibrant and picturesque scene. The tree's branches extend across the frame, with some reaching towards the top of the image. The sky in the background is a
--------------------------------------------------
$ python examples/offline_inference/vision_language.py -m skywork_chat
INFO 03-18 14:14:45 [model.py:533] Resolved architecture: SkyworkR1VChatModel
INFO 03-18 14:14:45 [model.py:1582] Using max model len 4096
INFO 03-18 14:14:45 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 03-18 14:14:45 [vllm.py:754] Asynchronous scheduling is enabled.
generation_config.json: 100%|██████████████████████████████████████████████████████████████████| 181/181 [00:00<00:00, 2.59MB/s]
(EngineCore pid=2441252) INFO 03-18 14:14:47 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev60+g17c47fb86) with config: model='Skywork/Skywork-R1V-38B', speculative_config=None, tokenizer='Skywork/Skywork-R1V-38B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Skywork/Skywork-R1V-38B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
...
--------------------------------------------------
Alright, let's take a look at this image. It seems to be a beautiful scene with cherry blossoms in the foreground and a tall tower in the background. The cherry blossoms are in full bloom, with their delicate pink flowers creating a soft, ethereal atmosphere. The tower appears to be a well-known landmark
--------------------------------------------------
Alright, so I'm looking at this image, and I want to figure out what's going on here. Let me start by breaking it down. The image is a close-up of cherry blossoms, which are in full bloom. The flowers are pink and delicate, and they're spread out across the branches of a
--------------------------------------------------
Alright, so I'm looking at this image, and I want to figure out what it's showing. Let me start by breaking it down. The first thing I notice is the abundance of pink flowers. They seem to be cherry blossoms, which are pretty common in springtime, especially in places like Japan. The
--------------------------------------------------
Alright, so I'm looking at this image, and I need to figure out what's going on. Let me start by breaking it down. The image is a close-up of a cherry blossom tree with lots of pink flowers. The branches are in the foreground, and through them, I can see a tall, cylindrical
--------------------------------------------------
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Purpose
Follow-up to #37289
*ImageProcessorand*VideoProcessor.self.ctx.init_processor, in order to avoid extra kwargs causing error.Note: Nemotron Parse and Nano Nemotron VL's processor will be handled in a separate PR.
Test Plan
python examples/offline_inference/vision_language.py(except for NVLM_D and Skywork which are too big to load in memory)python examples/offline_inference/vision_language_multi_image.pyfor InternVLTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.