[Bagel]Fused gate_proj and up_proj by princepride · Pull Request #2546 · vllm-project/vllm-omni

princepride · 2026-04-07T08:57:00Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

While completing LoRA yesterday, I noticed that the Bagel Diffusion MLP does not have the Fuze gate proj and up proj, I think we need align with other model's implement.

Test Plan

pytest -s -v tests/e2e/offline_inference/test_bagel_img2img.py   -m "advanced_model" --run-level "advanced_model"

Test Result

======================================== test session starts ========================================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3.12
cachedir: .pytest_cache
rootdir: /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni
configfile: pyproject.toml
plugins: forked-1.6.0, typeguard-4.5.1, timeout-2.4.0, hydra-core-1.3.2, asyncio-1.3.0, rerunfailures-16.1, shard-0.1.2, anyio-4.12.1
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                    
Running 1 items in this shard: tests/e2e/offline_inference/test_bagel_img2img.py::test_bagel_img2img_shared_memory_connector

tests/e2e/offline_inference/test_bagel_img2img.py::test_bagel_img2img_shared_memory_connector 
=== PRE-TEST GPU CLEANUP ===
GPU cleanup disabled
INFO 04-07 08:32:33 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-07 08:32:33 [vllm.py:790] Asynchronous scheduling is enabled.
--- Running test: test_bagel_img2img_shared_memory_connector
INFO 04-07 08:32:33 [weight_utils.py:50] Using model weights format ['*']
INFO 04-07 08:32:33 [omni_base.py:93] [Omni] Initializing with model ByteDance-Seed/BAGEL-7B-MoT
INFO 04-07 08:32:33 [async_omni_engine.py:264] [AsyncOmniEngine] Initializing with model ByteDance-Seed/BAGEL-7B-MoT
INFO 04-07 08:32:33 [async_omni_engine.py:296] [AsyncOmniEngine] Launching Orchestrator thread with 2 stages
INFO 04-07 08:32:33 [initialization.py:233] Auto-configuring SharedMemoryConnector for edge ('0', '1')
INFO 04-07 08:32:33 [initialization.py:270] Loaded OmniTransferConfig with 1 connector configurations
INFO 04-07 08:32:33 [async_omni_engine.py:514] [AsyncOmniEngine] Initializing stage 0
INFO 04-07 08:32:33 [stage_init_utils.py:229] [stage_init] Stage-0 set runtime devices: 0
INFO 04-07 08:32:33 [async_omni_engine.py:514] [AsyncOmniEngine] Initializing stage 1
WARNING 04-07 08:32:34 [config.py:347] Config format `mistral` is already registered, and will be overwritten by the new parser class `<class 'vllm_omni.model_executor.models.voxtral_tts.configuration_voxtral_tts.VoxtralTTSConfigParser'>`.
INFO 04-07 08:32:34 [config.py:358] Registered config parser `<class 'vllm_omni.model_executor.models.voxtral_tts.configuration_voxtral_tts.VoxtralTTSConfigParser'>` with config format `mistral`
INFO 04-07 08:32:34 [model.py:549] Resolved architecture: OmniBagelForConditionalGeneration
INFO 04-07 08:32:34 [model.py:1678] Using max model len 32768
INFO 04-07 08:32:34 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=32768.
WARNING 04-07 08:32:34 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 04-07 08:32:34 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 04-07 08:32:34 [vllm.py:1025] Cudagraph is disabled under eager mode
WARNING 04-07 08:32:34 [cuda.py:199] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
INFO 04-07 08:32:34 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
INFO 04-07 08:32:34 [async_omni_engine.py:404] [AsyncOmniEngine] Stage 0 engine launch started
INFO 04-07 08:32:34 [stage_init_utils.py:229] [stage_init] Stage-1 set runtime devices: 0
(StageEngineCoreProc pid=502561) INFO 04-07 08:32:41 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='ByteDance-Seed/BAGEL-7B-MoT', speculative_config=None, tokenizer='ByteDance-Seed/BAGEL-7B-MoT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ByteDance-Seed/BAGEL-7B-MoT, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(StageEngineCoreProc pid=502561) WARNING 04-07 08:32:41 [multiproc_executor.py:1014] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(StageEngineCoreProc pid=502561) INFO 04-07 08:32:41 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.186.186 (local), world_size=1, local_world_size=1
INFO 04-07 08:32:44 [multiproc_executor.py:105] Starting server...
(Worker pid=503073) INFO 04-07 08:32:49 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:38029 backend=nccl
[W407 08:32:49.117971690 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(Worker pid=503073) INFO 04-07 08:32:49 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
INFO 04-07 08:32:52 [diffusion_worker.py:400] Worker 0 created result MessageQueue
INFO 04-07 08:32:52 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-07 08:32:52 [vllm.py:790] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-07 08:32:52 [diffusion_worker.py:133] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-07 08:32:52 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-07 08:32:52 [parallel_state.py:630] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-07 08:32:53 [weight_utils.py:50] Using model weights format ['*']
INFO 04-07 08:32:54 [weight_utils.py:625] No diffusion_pytorch_model.safetensors.index.json found in remote.
Multi-thread loading shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Multi-thread loading shards:  50% Completed | 1/2 [00:00<00:00,  1.79it/s]
(Worker pid=503073) /usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py:51: UserWarning: The following named arguments are not valid for `SiglipImageProcessor.preprocess` and were ignored: 'truncation'
(Worker pid=503073)   return self.preprocess(images, **kwargs)
(Worker pid=503073) INFO 04-07 08:32:58 [gpu_model_runner.py:4735] Starting to load model ByteDance-Seed/BAGEL-7B-MoT...
(Worker pid=503073) INFO 04-07 08:32:59 [vllm.py:790] Asynchronous scheduling is enabled.
(Worker pid=503073) WARNING 04-07 08:32:59 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(Worker pid=503073) WARNING 04-07 08:32:59 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker pid=503073) INFO 04-07 08:32:59 [vllm.py:1025] Cudagraph is disabled under eager mode
(Worker pid=503073) INFO 04-07 08:32:59 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(Worker pid=503073) INFO 04-07 08:32:59 [cuda.py:334] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=503073) WARNING 04-07 08:32:59 [bagel.py:391] Overriding vit_config.num_hidden_layers from 27 to 26 to match the Bagel model checkpoint.
(Worker pid=503073) WARNING 04-07 08:32:59 [bagel.py:397] Setting vit_config.vision_use_head to False as it is not present in the Bagel model checkpoint.
(Worker pid=503073) INFO 04-07 08:32:59 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker pid=503073) INFO 04-07 08:32:59 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
Multi-thread loading shards: 100% Completed | 2/2 [00:05<00:00,  3.00s/it]
Multi-thread loading shards: 100% Completed | 2/2 [00:05<00:00,  2.63s/it]

INFO 04-07 08:32:59 [pipeline_bagel.py:779] BagelPipeline weight filter kept 1466/1467 tensors (shape mismatches seen: 0)
INFO 04-07 08:33:00 [diffusers_loader.py:324] Loading weights took 5.98 seconds
INFO 04-07 08:33:00 [diffusion_model_runner.py:141] Model loading took 26.4738 GiB and 8.454511 seconds
INFO 04-07 08:33:00 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-07 08:33:00 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-07 08:33:00 [diffusion_worker.py:163] Worker 0: Process-scoped GPU memory after model loading: 27.20 GiB.
INFO 04-07 08:33:00 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-07 08:33:00 [diffusion_worker.py:98] Worker 0: Initialization complete.
INFO 04-07 08:33:00 [diffusion_worker.py:538] Worker 0: Scheduler loop started.
INFO 04-07 08:33:00 [diffusion_worker.py:461] Worker 0 ready to receive requests via shared memory
INFO 04-07 08:33:00 [diffusion_engine.py:402] dummy run to warm up the model
INFO 04-07 08:33:00 [kv_transfer_manager.py:143] Initializing OmniConnector with config: {'type': 'SharedMemoryConnector', 'shm_threshold_bytes': 65536, 'role': 'receiver'}
INFO 04-07 08:33:00 [factory.py:46] Created connector: SharedMemoryConnector
INFO 04-07 08:33:00 [kv_transfer_manager.py:397] Wait for KV cache for request dummy_req_id from stage 0 to 1...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 76.65it/s]
(Worker pid=503073) 
(Worker pid=503073) INFO 04-07 08:33:06 [default_loader.py:384] Loading weights took 5.71 seconds
(Worker pid=503073) INFO 04-07 08:33:07 [gpu_model_runner.py:4820] Model loading took 27.37 GiB memory and 7.608572 seconds
(Worker pid=503073) INFO 04-07 08:33:07 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 1 img2img items of the maximum feature size.
(Worker pid=503073) /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/diffusion/models/bagel/bagel_transformer.py:1060: UserWarning: Using a non-tuple sequence for multidimensional indexing is deprecated and will be changed in pytorch 2.9; use x[tuple(seq)] instead of x[seq]. In pytorch 2.9 this will be interpreted as tensor index, x[torch.tensor(seq)], which will result either in an error or a different result (Triggered internally at /pytorch/torch/csrc/autograd/python_variable_indexing.cpp:347.)
(Worker pid=503073)   return self.pos_embed[position_ids]
(Worker pid=503073) INFO 04-07 08:33:10 [base.py:129] Available KV cache memory: 34.44 GiB (process-scoped)
(StageEngineCoreProc pid=502561) INFO 04-07 08:33:10 [kv_cache_utils.py:1319] GPU KV cache size: 644,944 tokens
(StageEngineCoreProc pid=502561) INFO 04-07 08:33:10 [kv_cache_utils.py:1324] Maximum concurrency for 32,768 tokens per request: 19.68x
(Worker pid=503073) 2026-04-07 08:33:10,263 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=503073) 2026-04-07 08:33:11,880 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
(StageEngineCoreProc pid=502561) INFO 04-07 08:33:11 [core.py:283] init engine (profile, create kv cache, warmup model) took 4.35 seconds
(StageEngineCoreProc pid=502561) WARNING 04-07 08:33:12 [scheduler.py:180] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(StageEngineCoreProc pid=502561) /usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py:51: UserWarning: The following named arguments are not valid for `SiglipImageProcessor.preprocess` and were ignored: 'truncation'
(StageEngineCoreProc pid=502561)   return self.preprocess(images, **kwargs)
(StageEngineCoreProc pid=502561) INFO 04-07 08:33:21 [vllm.py:790] Asynchronous scheduling is enabled.
(StageEngineCoreProc pid=502561) WARNING 04-07 08:33:21 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(StageEngineCoreProc pid=502561) WARNING 04-07 08:33:21 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(StageEngineCoreProc pid=502561) INFO 04-07 08:33:21 [vllm.py:1025] Cudagraph is disabled under eager mode
(StageEngineCoreProc pid=502561) INFO 04-07 08:33:21 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
INFO 04-07 08:33:21 [async_omni_engine.py:406] [AsyncOmniEngine] Stage 0 engine startup completed
ERROR 04-07 08:33:30 [kv_transfer_manager.py:426] Timeout waiting for KV cache for request dummy_req_id after 30.0s
INFO 04-07 08:33:31 [diffusion_model_runner.py:212] Peak GPU memory (this request): 27.67 GB reserved, 27.04 GB allocated, 0.63 GB pool overhead (2.3%)
INFO 04-07 08:33:31 [stage_diffusion_proc.py:66] StageDiffusionProc initialized with model: ByteDance-Seed/BAGEL-7B-MoT
INFO 04-07 08:33:31 [stage_diffusion_client.py:84] [StageDiffusionClient] Stage-1 initialized (batch_size=1)
INFO 04-07 08:33:31 [async_omni_engine.py:544] [AsyncOmniEngine] Stage 1 initialized (diffusion, batch_size=1)
INFO 04-07 08:33:31 [stage_engine_core_client.py:80] [StageEngineCoreClient] Stage-0 initializing EngineCore
INFO 04-07 08:33:31 [stage_engine_core_client.py:107] [StageEngineCoreClient] Stage-0 EngineCore running
INFO 04-07 08:33:41 [async_omni_engine.py:482] [AsyncOmniEngine] Stage 0 initialized
INFO 04-07 08:33:41 [orchestrator.py:158] [Orchestrator] Starting event loop
INFO 04-07 08:33:41 [async_omni_engine.py:338] [AsyncOmniEngine] Orchestrator ready with 2 stages
INFO 04-07 08:33:41 [omni_base.py:106] [Omni] AsyncOmniEngine initialized in 68.21 seconds
INFO 04-07 08:33:41 [omni_base.py:121] [Omni] Initialized with 2 stages for model ByteDance-Seed/BAGEL-7B-MoT
WARNING 04-07 08:33:41 [utils.py:485] Invalid output modality: img2img, ignoring it
WARNING 04-07 08:33:41 [input_processor.py:235] Passing raw prompts to InputProcessor is deprecated and will be removed in v0.18. You should instead pass the outputs of Renderer.render_cmpl() or Renderer.render_chat().
INFO 04-07 08:33:45 [orchestrator.py:621] [Orchestrator] _handle_add_request: stage=0 req=0_407a950d-8839-403b-a630-c12ff7b7e6a7 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
INFO 04-07 08:33:45 [stage_engine_core_client.py:116] [StageEngineCoreClient] Stage-0 adding request: 0_407a950d-8839-403b-a630-c12ff7b7e6a7
(Worker pid=503073) WARNING 04-07 08:33:45 [gpu_model_runner.py:369] additional_information on request data is deprecated, use model_intermediate_buffer
INFO 04-07 08:33:46 [stage_engine_core_client.py:116] [StageEngineCoreClient] Stage-0 adding request: 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_text
INFO 04-07 08:33:46 [async_omni_engine.py:829] [AsyncOmniEngine] CFG expansion for req 0_407a950d-8839-403b-a630-c12ff7b7e6a7: 2 companions
INFO 04-07 08:33:46 [orchestrator.py:785] [Orchestrator] CFG companion submitted: 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_text (role=cfg_text, parent=0_407a950d-8839-403b-a630-c12ff7b7e6a7)
INFO 04-07 08:33:46 [stage_engine_core_client.py:116] [StageEngineCoreClient] Stage-0 adding request: 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_img
INFO 04-07 08:33:46 [orchestrator.py:785] [Orchestrator] CFG companion submitted: 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_img (role=cfg_img, parent=0_407a950d-8839-403b-a630-c12ff7b7e6a7)
Processed prompts:   0%|                                                       | 0/1 [00:00<?, ?it/s](Worker pid=503073) /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/diffusion/models/bagel/bagel_transformer.py:1060: UserWarning: Using a non-tuple sequence for multidimensional indexing is deprecated and will be changed in pytorch 2.9; use x[tuple(seq)] instead of x[seq]. In pytorch 2.9 this will be interpreted as tensor index, x[torch.tensor(seq)], which will result either in an error or a different result (Triggered internally at /pytorch/torch/csrc/autograd/python_variable_indexing.cpp:347.)
(Worker pid=503073)   return self.pos_embed[position_ids]
(Worker pid=503073) INFO 04-07 08:33:46 [kv_transfer_manager.py:143] Initializing OmniConnector with config: {'type': 'SharedMemoryConnector', 'shm_threshold_bytes': 65536, 'role': 'sender'}
(Worker pid=503073) INFO 04-07 08:33:46 [factory.py:46] Created connector: SharedMemoryConnector
(Worker pid=503073) INFO 04-07 08:33:47 [kv_transfer_manager.py:321] KV transfer OK: 0_407a950d-8839-403b-a630-c12ff7b7e6a7, 435875836 bytes
(Worker pid=503073) INFO 04-07 08:33:49 [kv_transfer_manager.py:321] KV transfer OK: 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_text, 435532161 bytes
INFO 04-07 08:33:49 [orchestrator.py:519] [Orchestrator] Attaching cfg_kv_request_ids={'cfg_text': '0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_text', 'cfg_img': '0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_img'} to req 0_407a950d-8839-403b-a630-c12ff7b7e6a7
(Worker pid=503073) INFO 04-07 08:33:49 [kv_transfer_manager.py:321] KV transfer OK: 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_img, 461563 bytes
INFO 04-07 08:33:49 [kv_transfer_manager.py:397] Wait for KV cache for request 0_407a950d-8839-403b-a630-c12ff7b7e6a7 from stage 0 to 1...
INFO 04-07 08:33:49 [kv_transfer_manager.py:410] Successfully received KV cache for 0_407a950d-8839-403b-a630-c12ff7b7e6a7, 435875836 bytes
INFO 04-07 08:33:49 [kv_transfer_manager.py:397] Wait for KV cache for request 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_text from stage 0 to 1...
INFO 04-07 08:33:50 [kv_transfer_manager.py:410] Successfully received KV cache for 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_text, 435532161 bytes
INFO 04-07 08:33:50 [bagel.py:261] Collected CFG KV cache for role=cfg_text, rid=0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_text, size=435532161 bytes
INFO 04-07 08:33:50 [kv_transfer_manager.py:397] Wait for KV cache for request 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_img from stage 0 to 1...
INFO 04-07 08:33:50 [kv_transfer_manager.py:410] Successfully received KV cache for 0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_img, 461563 bytes
INFO 04-07 08:33:50 [bagel.py:261] Collected CFG KV cache for role=cfg_img, rid=0_407a950d-8839-403b-a630-c12ff7b7e6a7__cfg_img, size=461563 bytes
INFO 04-07 08:33:50 [kv_transfer_manager.py:529] Applied CFG KV caches: ['cfg_text_past_key_values', 'cfg_text_kv_metadata', 'cfg_img_past_key_values', 'cfg_img_kv_metadata']
INFO 04-07 08:33:50 [pipeline_bagel.py:353] Using injected KV Cache (direct)
INFO 04-07 08:33:50 [pipeline_bagel.py:367] CFG enabled with multi-KV: using injected cfg_text KV Cache
INFO 04-07 08:33:54 [diffusion_model_runner.py:212] Peak GPU memory (this request): 31.32 GB reserved, 29.62 GB allocated, 1.70 GB pool overhead (5.4%)
INFO 04-07 08:33:54 [diffusion_engine.py:119] Generation completed successfully.
INFO 04-07 08:33:54 [diffusion_engine.py:152] Post-processing completed in 0.0000 seconds
INFO 04-07 08:33:54 [diffusion_engine.py:155] DiffusionEngine.step breakdown: preprocess=0.00 ms, add_req_and_wait=5308.16 ms, postprocess=0.00 ms, total=5308.32 ms
Processed prompts: 100%|███████████████████████████████████████████████| 1/1 [00:08<00:00,  8.56s/it]INFO 04-07 08:33:54 [omni_base.py:162] [Summary] {}
Processed prompts: 100%|███████████████████████████████████████████████| 1/1 [00:08<00:00,  8.56s/it]
INFO 04-07 08:33:54 [omni_base.py:290] [Omni] Shutting down
INFO 04-07 08:33:54 [async_omni_engine.py:1288] [AsyncOmniEngine] Shutting down Orchestrator
INFO 04-07 08:33:54 [orchestrator.py:212] [Orchestrator] Received shutdown signal
INFO 04-07 08:33:54 [orchestrator.py:885] [Orchestrator] Shutting down all stages
(Worker pid=503073) INFO 04-07 08:33:54 [multiproc_executor.py:764] Parent process exited, terminating worker queues
(Worker pid=503073) INFO 04-07 08:33:54 [multiproc_executor.py:859] WorkerProc shutting down.
INFO 04-07 08:33:56 [orchestrator.py:889] [Orchestrator] Stage 0 shut down
INFO 04-07 08:33:56 [diffusion_worker.py:490] Worker 0: Received shutdown message
INFO 04-07 08:33:56 [diffusion_worker.py:511] event loop terminated.
INFO 04-07 08:33:56 [diffusion_worker.py:546] Worker 0: Shutdown complete.
INFO 04-07 08:33:59 [orchestrator.py:889] [Orchestrator] Stage 1 shut down
PASSEDGPU cleanup disabled


========================================= warnings summary ==========================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../../../../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/e2e/offline_inference/test_bagel_img2img.py::test_bagel_img2img_shared_memory_connector
  /usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py:51: UserWarning: The following named arguments are not valid for `SiglipImageProcessor.preprocess` and were ignored: 'truncation'
    return self.preprocess(images, **kwargs)

tests/e2e/offline_inference/test_bagel_img2img.py::test_bagel_img2img_shared_memory_connector
tests/e2e/offline_inference/test_bagel_img2img.py::test_bagel_img2img_shared_memory_connector
  /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/distributed/omni_connectors/utils/serialization.py:290: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
    return Image.fromarray(arr, mode=mode)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================= 1 passed, 19 warnings in 86.36s (0:01:26) =============================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

chatgpt-codex-connector · 2026-04-07T08:57:06Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

gcanlin

LGTM

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: bob-021206 <binyan_github@163.com>

[Bagel]Fused gate_proj and up_proj

6264fa4

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride requested a review from hsliuustc0106 as a code owner April 7, 2026 08:57

gcanlin approved these changes Apr 7, 2026

View reviewed changes

gcanlin added the ready label to trigger buildkite CI label Apr 7, 2026

gcanlin enabled auto-merge (squash) April 7, 2026 09:41

gcanlin merged commit 408365f into vllm-project:main Apr 7, 2026
7 of 8 checks passed

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[Bagel]Fused gate_proj and up_proj (vllm-project#2546)

c1abce5

Signed-off-by: princepride <wangzhipeng628@gmail.com>

bob-021206 pushed a commit to jasonlee-1024/vllm-omni that referenced this pull request Apr 21, 2026

[Bagel]Fused gate_proj and up_proj (vllm-project#2546)

f5ee425

Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: bob-021206 <binyan_github@163.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bagel]Fused gate_proj and up_proj#2546

[Bagel]Fused gate_proj and up_proj#2546
gcanlin merged 1 commit intovllm-project:mainfrom
princepride:fused-gate-proj-up-proj

princepride commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 7, 2026

Uh oh!

gcanlin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

princepride commented Apr 7, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 7, 2026

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants