Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
194 commits
Select commit Hold shift + click to select a range
c9388c7
[V0 Deprecation][Models] Remove all V0 condition for mm embeddings me…
Isotr0py Sep 29, 2025
86502dc
[Misc] Remove more `get_input_embeddings_v0` (#25857)
DarkLight1337 Sep 29, 2025
219bc0b
refactor - pass tokens_per_frame and num_frames to compute_retained_t…
tomeras91 Sep 29, 2025
23a205f
WIP - commit with all commented code
tomeras91 Sep 30, 2025
e8fd68a
Revert "WIP - commit with all commented code"
tomeras91 Sep 30, 2025
859e9f1
Manually deal with video prompt replacement instead of relying on vLL…
tomeras91 Sep 30, 2025
69ea5b8
support multiple videos in a batch (and better typehints)
tomeras91 Sep 30, 2025
0adec4b
Add EVS TODOs
tomeras91 Sep 30, 2025
d1a4d41
access tokenizer only when needed instead of saving it as attribute o…
tomeras91 Sep 30, 2025
a7417d0
Fix issue with using top-left tile instead of thumbnail tile
BloodAxe Oct 3, 2025
20fbfd7
Seemingly working version of Nano 2 with EVS
BloodAxe Oct 3, 2025
46a2847
remove debug script
tomeras91 Oct 5, 2025
f9bf392
update to latest deepgemm for dsv3.2 (#25871)
youkaichao Sep 29, 2025
8003828
[Bugfix] Fix requirements paths in install instructions (#25827)
yingjun-mou Sep 29, 2025
d4dc907
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantize…
zhoukezi Sep 29, 2025
0261d11
[torch.compile] serialize cudagraph_mode as its enum name instead of …
ZJY0516 Sep 29, 2025
55327c7
[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (#24690)
chenxi-yang Sep 29, 2025
c693625
[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (…
rahul-tuli Sep 29, 2025
daaf453
[CI/Build] Include Transformers backend test in nightly transformers …
Isotr0py Sep 29, 2025
671b93c
[Model] Remove MotifForCausalLM (#25866)
jeejeelee Sep 29, 2025
577110c
[Bugfix] Use correct key "ignore" for config.json non-quantized layer…
leejnau Sep 29, 2025
21face0
[BugFix][torch.compile] KV scale calculation issues with FP8 quantiza…
adabeyta Sep 29, 2025
2e38ecf
[Doc] Add documentation for vLLM continuous benchmarking and profilin…
namanlalitnyu Sep 29, 2025
776946a
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libn…
gshtras Sep 29, 2025
1476b1c
[Kernel] Chunk-aligned mamba2 (#24683)
tdoublep Sep 29, 2025
7d60078
[Doc] Polish example for torchrun dp (#25899)
zhuohan123 Sep 29, 2025
7baeed5
[NIXL] Increase default KV block eviction timeout on P (#25897)
NickLucche Sep 29, 2025
4021798
[V0 Deprecation] Remove `vllm.worker` and update according imports (#…
aarnphm Sep 29, 2025
1572926
Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT…
qthequartermasterman Sep 30, 2025
4256477
[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (#25909)
yewentao256 Sep 30, 2025
ba97f4f
[Benchmark] Support benchmark throughput for external launcher DP (#2…
zhuohan123 Sep 30, 2025
bff1764
Move`VllmConfig` from `config/__init__.py` to `config/vllm.py` (#25271)
hmellor Sep 30, 2025
2a2c0b5
[BugFix] Fix DP/EP hang (#25906)
LucasWilkinson Sep 30, 2025
65944e5
[BugFix] Pass config_format via try_get_generation_config (#25912)
acisseJZhong Sep 30, 2025
eae25d9
[Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorr…
zhoukezi Sep 30, 2025
1fdef63
[Bugfix]: Clean up chunked prefill logging when using whisper (#25075)
simondanielsson Sep 30, 2025
d2195ab
[New Model] DeepSeek-V3.2 (Rebased to Main) (#25896)
zyongye Sep 30, 2025
733e515
[Doc] Add Cambricon MLU support (#25942)
a120092009 Sep 30, 2025
893c7f8
Updated TRL integration docs (#25684)
sergiopaniego Sep 30, 2025
1bdb001
[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (#25936)
CSWYF3634076 Sep 30, 2025
0d64369
[Model] Move `vision_feature_select_strategy` into `resolve_visual_en…
DarkLight1337 Sep 30, 2025
fd3f60f
[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)
lhtin Sep 30, 2025
6767f8c
[NIXL] Add support for MLA caches with different latent dim (#25902)
NickLucche Sep 30, 2025
f540576
[CI] Move applicable tests to CPU (#24080)
rzabarazesh Sep 30, 2025
62b3535
[Fix] Improve CPU backend compatibility for RISC-V (#25816)
ihb2032 Sep 30, 2025
23fcf23
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 a…
Josephasafg Sep 30, 2025
74f323b
Add Hugging Face Inference Endpoints guide to Deployment docs (#25886)
sergiopaniego Sep 30, 2025
175c835
[Bugfix][Model] Fix inference for Hunyuan dense models (#25354)
Anionex Sep 30, 2025
c058872
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (#2…
pavanimajety Sep 30, 2025
f960f1e
[Bugfix] Token type and position embeddings fail to be applied to `in…
DarkLight1337 Sep 30, 2025
2432e04
[bugfix][deepseek] fix flashmla kernel selection (#25956)
youkaichao Sep 30, 2025
1f9d23d
[Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute…
yewentao256 Sep 30, 2025
b85d33b
[Doc] Improve MM Pooling model documentation (#25966)
DarkLight1337 Sep 30, 2025
e6681b4
[Docs] Add moe kernel features doc (#25297)
bnellnm Sep 30, 2025
680223f
OffloadingConnector: Fix GPU block tracking bug (#25856)
orozery Sep 30, 2025
055680f
[Llama4] [multimodal] Fix misplaced dtype cast of `cos_sin_cache` in …
cjackal Sep 30, 2025
9be6890
[Bench] Add DeepSeekV32 to MoE benchmark (#25962)
jeejeelee Sep 30, 2025
369f144
[V1] [P/D] Add Support for KV Load Failure Recovery (#19330)
sdavidbd Sep 30, 2025
8e00d2e
Add explicit pooling classes for the Transformers backend (#25322)
hmellor Sep 30, 2025
8071c5a
[Docs] Remove API Reference from search index (#25949)
hmellor Sep 30, 2025
edf0b6e
[gpt-oss] use vLLM instead of openai types for streaming (#25186)
qandrew Sep 30, 2025
4ba3705
[Misc] Make EP kernels install script support uv (#25785)
LucasWilkinson Sep 30, 2025
670382a
[Model] MTP fallback to eager for DeepSeek v32 (#25982)
luccafong Oct 1, 2025
1212587
Update launch_bounds_utils.h for correct compile on Multiple Cuda Arc…
DrStone1971 Oct 1, 2025
51ee4c9
[Log] Optimize Log for FP8MOE (#25709)
yewentao256 Oct 1, 2025
73e138a
Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)
certainly-param Oct 1, 2025
c71f8ef
[MM] Add text-only mode for Qwen3-VL (#26000)
ywang96 Oct 1, 2025
d26bae4
[Bugfix] Fix `__syncwarp` on ROCM (#25996)
zhewenl Oct 1, 2025
5dd79da
[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (#25988)
LucasWilkinson Oct 1, 2025
5a8a8fc
Update to Transformers `v4.56.2` (#24638)
hmellor Oct 1, 2025
e313609
[Misc]allow disable pynccl (#25421)
luccafong Oct 1, 2025
54b8e41
[Doc] updating torch.compile doc link (#25989)
vnadathur Oct 1, 2025
7e71da5
[BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-t…
wwl2755 Oct 1, 2025
5d22264
[Misc] Factor out common `_apply_feature_select_strategy` (#26003)
DarkLight1337 Oct 1, 2025
2add6d5
[CI] Only capture a single CUDA graph size in CI by default (#25951)
hmellor Oct 1, 2025
4ef812a
[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes <…
billishyahao Oct 1, 2025
45b3629
[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type …
natoscott Oct 1, 2025
efc7a1b
[Bugfix] Apply same sampling parameters for both `n=1` and `n>1` (#26…
kmaehashi Oct 1, 2025
4b427d8
[NVIDIA] Blackwell Family (#24673)
johnnynunez Oct 1, 2025
4d7c7eb
Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_in…
hl475 Oct 1, 2025
04d85e2
[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030)
mgoin Oct 1, 2025
f882803
[BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026)
LucasWilkinson Oct 1, 2025
56c7852
[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908)
hyoon1 Oct 1, 2025
0a212d5
[Bug] Fix Negative Cuda Memory Usage (#25683)
yewentao256 Oct 1, 2025
4fff719
[BugFix] ChunkedLocalAttention is currently not CG compatible (#26034)
LucasWilkinson Oct 1, 2025
ac1dec8
Support RL online quantization with torchao (#23014)
jerryzh168 Oct 1, 2025
abc3966
[ROCm][Bugfix] Add missing parameter to ROCm backend (#26029)
gshtras Oct 2, 2025
79c8bed
[Misc] Make handling of SamplingParams clearer in n>1 case (#26032)
njhill Oct 2, 2025
ec625a7
Run:ai model streamer add GCS package support (#24909)
pwschuurman Oct 2, 2025
6137ac0
Update base image to 22.04 (jammy) (#26065)
huydhn Oct 2, 2025
ee5f2ad
Change size of single CUDA graph for CI to 4 (#26089)
tdoublep Oct 2, 2025
eeb4b15
[FA/Chore] Bump vllm-flash-attention (#25537)
LucasWilkinson Oct 2, 2025
ff8945d
[Model] Use `merge_by_field_config` for MM models (A-C) (#26073)
DarkLight1337 Oct 2, 2025
635f277
[Model] Use `merge_by_field_config` for MM models (D-F) (#26076)
DarkLight1337 Oct 2, 2025
a77f694
[Platform][CI] Added OOT platform interface e2e test that running on …
leo-pony Oct 2, 2025
37c2551
[Qwen][ROCm] Flash Attention Rotary Embeddings (#24642)
vllmellm Oct 2, 2025
2876b00
[CI] Add Blackwell DeepSeek FP8 FlashInfer MoE tests (#26040)
mgoin Oct 2, 2025
16c4ce6
[CI/Build] Replace `vllm.entrypoints.openai.api_server` entrypoint wi…
DarkLight1337 Oct 2, 2025
6379eae
[BugFix] Fix FI accuracy issue when used for MLA prefill (#26063)
LucasWilkinson Oct 2, 2025
a26a1d3
[Small] Prevent bypassing media domain restriction via HTTP redirects…
huachenheli Oct 2, 2025
3389e2a
[Deepseek v3.2] Support indexer prefill chunking (#25999)
heheda12345 Oct 2, 2025
63825a2
EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…
ekagra-ranjan Oct 2, 2025
cc258ed
[Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MT…
heheda12345 Oct 2, 2025
268ef21
[Perf] Fix and reapply move apply w8a8 block fp8 linear to class (#25…
ElizaWszola Oct 2, 2025
001a19c
Fix MTP with deepep_low_latency (#25904)
MatthewBonanni Oct 2, 2025
13dcdb5
[Bugfix] Disable cascade attention with FlashInfer (#26130)
mgoin Oct 2, 2025
5508cce
[Log] Optimize DeepGEMM Missing Log (#26106)
yewentao256 Oct 3, 2025
0b5de21
[Bug][Benchmark] Fix duplicate req in oversampling (#26140)
ekagra-ranjan Oct 3, 2025
298e730
[Attention] Move Backend enum into registry (#25893)
MatthewBonanni Oct 3, 2025
c2a2acd
[CI/Build] Conditionally register cutlass_fp4_group_mm to fix buildin…
mgoin Oct 3, 2025
88fb7b4
[DeepSeek] Improve performance of DS MLA cache kernel (#26132)
MatthewBonanni Oct 3, 2025
bf6ddfa
[Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (#26144)
benchislett Oct 3, 2025
3064f88
[gpt-oss] disable tool server initialization if no tool in request (#…
qandrew Oct 3, 2025
63c869d
[Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (…
tlrmchlsmth Oct 3, 2025
4e88df0
[ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (#2…
tjtanaa Oct 3, 2025
a2079d6
[Bugfix] Fix import `gemm_afp4wfp4` failure on AMD (#26068)
zhewenl Oct 3, 2025
82e112e
[Model] Use `merge_by_field_config` for MM models (G) (#26117)
DarkLight1337 Oct 3, 2025
9fba170
`FusedMoE` support for the Transformers backend (#22650)
hmellor Oct 3, 2025
0e13c0a
[BUG] Reorder model config creation (#26124)
hao-aaron Oct 3, 2025
a9e50dd
[Misc] Remove typing.List (#26150)
varun-sundar-rabindranath Oct 3, 2025
f7502c5
[Input] Remove unused `prompt` field (#26097)
DarkLight1337 Oct 3, 2025
f021439
[Perf] Optimize `reshape_and_cache` CUDA Kernel (#25955)
ZJY0516 Oct 3, 2025
09ffe07
add(v1): RequestStatesStats to RequestOutput (#24947)
huijjj Oct 3, 2025
0a3b75c
[Model] Use `merge_by_field_config` for MM models (InternVL family) (…
DarkLight1337 Oct 3, 2025
60e9d4f
[test utils] correct wrong typing (#26159)
yannicks1 Oct 3, 2025
a05aa92
[CI] Fix distributed hybrid tests in CI (#26155)
tdoublep Oct 3, 2025
48f7031
[NIXL][Misc] Expose metrics from NIXL for logging to CLI (#25388)
NickLucche Oct 3, 2025
2b4eadc
[openai] Fix missing tool usage check (system message) (#24768)
levunet Oct 3, 2025
4023428
[Multi Modal] Configurable MM Profiling (#25631)
wwl2755 Oct 3, 2025
d0b6bef
[Doc] Fixed shape description for fused_batched_moe.py (#25668)
Egor-Krivov Oct 3, 2025
7a9f450
Quick fix for IMA with the Prefix Prefill kernel during graph capture…
SageMoore Oct 3, 2025
16414a0
[Renderer] Move Processor out of AsyncLLM (#24138)
KKSK-DON Oct 3, 2025
ec37d88
[Bugfix] Re-enable prefill of max model length (#24446)
yannicks1 Oct 3, 2025
cf7f947
[backends][short_conv] CUDA graph piecewise edits (#24215)
paulpak58 Oct 3, 2025
e0ad480
[Model] Supplement to PR 24862: Pass param prefix to LLMHead (#25805)
whx-sjtu Oct 3, 2025
03386b2
[CI/Build] do not enforce precompilation on tpu ci tests (#25992)
sixiang-google Oct 3, 2025
9c3d84f
[Model] Fixed stream generator for gpt-oss + spec-decoding (#26027)
astralord Oct 3, 2025
5a0dcdf
[Renderer] Move Processor out of LLMEngine (#26165)
DarkLight1337 Oct 3, 2025
4298159
Fix undefined symbol: cutlass_moe_mm_sm100 (#26098)
jasl Oct 3, 2025
fe4577d
[BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduce…
xuechendi Oct 3, 2025
55e7e7e
Stop mergify from keeping stale PRs alive (#26169)
hmellor Oct 3, 2025
6d4463d
Avoid division by zero in cache DS MLA kernel (#26174)
MatthewBonanni Oct 3, 2025
7cf2f77
Fix V1 engine serialization error with Ray distributed executor (#26148)
nrghosh Oct 3, 2025
c9ae940
[Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix…
pavanimajety Oct 3, 2025
17546d5
[Perf] Remove hardcoded num_warps=1 (#26183)
chelsea0x3b Oct 3, 2025
3642f77
[Refactor] Optimize FP8 MOE Backend Choice and Log (#26044)
yewentao256 Oct 3, 2025
674a6cd
[responsesAPI] add better error messaging for long prompts (#25724)
qandrew Oct 3, 2025
a0862bf
[Bugfix] Relax tokenizer regex for mixtral to include 'tokenizer.mode…
BowenBao Oct 3, 2025
35ea5af
[CI] Push multiarch manifests as nightly builds (#25764)
csahithi Oct 3, 2025
0ee1039
[Misc] Add penalties sampling parameters to serve tool (#25974)
southfreebird Oct 3, 2025
f68af14
[BugFix] Fix de-functionalization pass for rotary_embedding (#23953)
angelayi Oct 3, 2025
43b6959
[CI] Fix Pre-commit Mypy Error (#26181)
yewentao256 Oct 3, 2025
07f7a9a
[GPTOSS][DP/EP][Marlin] Enable GPTOSS DP/EP using Marlin kernels (#25…
varun-sundar-rabindranath Oct 4, 2025
7fe088c
Fix issue of using only the part of video frame [Nemotron Nano] (#26186)
BloodAxe Oct 4, 2025
c47afb0
[Bugfix] Fix qwen3 vl dummy data generation with overrides (#26193)
ywang96 Oct 4, 2025
1e50901
[BugFix] Use async Mistral Tokenizer in Chat Completions (#26134)
bbrowning Oct 4, 2025
0505a94
Add batch invariant kernel override for FlashInfer backend [2/n] (#25…
bwasti Oct 4, 2025
46e1130
[cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/…
fadara01 Oct 4, 2025
cf78827
[V1] [Hybrid] Mamba2 Automatic Prefix Caching (#25752)
s3woz Oct 4, 2025
505ce80
Support expert parallel in Transformers backend (#26162)
hmellor Oct 4, 2025
151293b
[Model] Support nested structures for TensorSchema (#26212)
DarkLight1337 Oct 4, 2025
85b632d
[Misc] Require `merge_by_field_config` argument (#26214)
DarkLight1337 Oct 4, 2025
371651b
[Misc] Remove unused `executor.apply_model` (#26215)
DarkLight1337 Oct 4, 2025
abe8a61
[CI Failure] fix_test_auto_prefix_cache_support (#26053)
hl475 Oct 4, 2025
030ccbf
Revert "Add batch invariant kernel override for FlashInfer backend [2…
DarkLight1337 Oct 4, 2025
4aa7dd6
Add Olmo 3 reasoning parser (#26054)
soldni Oct 4, 2025
2bbd103
[Core] Enable decode of context length equal to max model length (#26…
yannicks1 Oct 4, 2025
516f106
[Bugfix] Fix `_reqs_to_process` leak on abort (#26012)
NickLucche Oct 4, 2025
a63a36a
[Model] CLIP Embedding Support (#26010)
DarkLight1337 Oct 4, 2025
e3b1d98
Fix tensor device and dtype placement in Qwen2VL model (#26219)
yuafng Oct 4, 2025
0e8da6c
[V1] [Hybrid] Remove code to override default CUDA graph configuratio…
tdoublep Oct 4, 2025
f82a350
[CPU] Refine batch reorder of CPU attention backend (#26096)
bigPYJ1151 Oct 4, 2025
334ca27
[Frontend] Cache chat template kwargs resolution (#26227)
Isotr0py Oct 4, 2025
5bdc29b
[Renderer] Clean up renderer code (#26216)
DarkLight1337 Oct 4, 2025
70d9843
[Model] Use `merge_by_field_config` for MM models (H-L) (#26230)
DarkLight1337 Oct 5, 2025
b950e54
[Easy] Add str repr for IterationStats (#26232)
22quinn Oct 5, 2025
e33893e
[Bugfix] Allow `--skip-tokenizer-init` with `echo and return_token_id…
DarkLight1337 Oct 5, 2025
d7ccd65
Add documentation for granite 4 tool calling (#26175)
maxdebayser Oct 5, 2025
668ba11
[Perf][Easy] Early stop in request_block_hasher (#26112)
Jialin Oct 5, 2025
2aa85d7
[Bugfix]: Assertion error when using FlashInfer backend (#25933)
simondanielsson Oct 5, 2025
bad8d59
[Bugfix] Always apply MM processor even when no MM items are passed (…
DarkLight1337 Oct 5, 2025
318f3eb
[Bugfix][Hardware][RISC-V] Limit supported dtypes to float32 to avoid…
ihb2032 Oct 5, 2025
652a359
[Platform][Kernel] platform-specific kernel loading (#25823)
ILikeIneine Oct 5, 2025
1b2424f
Convert formatting to use `ruff` instead of `yapf` + `isort` (#26247)
hmellor Oct 5, 2025
29a4b3d
Remove all references to `yapf` as it's no longer used (#26251)
hmellor Oct 5, 2025
d6e4f05
Remove all cases of `fmt: on/off` (#26253)
hmellor Oct 5, 2025
e32e5e3
fix(tests): Resolve late binding of loop variable in assert message l…
ihb2032 Oct 5, 2025
c63b1fe
Fix per file ruff ignores related to typing (#26254)
hmellor Oct 5, 2025
3a8bfdb
Update `ruff` pre-commit hooks version (#26255)
hmellor Oct 5, 2025
bea94fb
[CI] fix mamba kernel test (#26250)
ZJY0516 Oct 5, 2025
dc12348
(1) Add video_pruning_rate to NanoNemotronVLProcessor and (2) bring b…
tomeras91 Oct 5, 2025
bf02a1e
consolideate all cases (non-EVS, EVS dummy, EVS real) to a single get…
tomeras91 Oct 5, 2025
9d589a0
reuse _merge_multimodal_embeddings instead of writing the same logic …
tomeras91 Oct 6, 2025
4ba5334
Merge branch 'main' into evs-nano-nemotron-vl
tomeras91 Oct 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
221 changes: 203 additions & 18 deletions vllm/model_executor/models/nano_nemotron_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@
maybe_prefix,
)
from vllm.multimodal import MULTIMODAL_REGISTRY
from vllm.multimodal.evs import (
compute_retained_tokens_count,
compute_retention_mask,
)
from vllm.multimodal.inputs import (
MultiModalDataDict,
MultiModalFieldConfig,
Expand All @@ -62,13 +66,20 @@
PromptReplacement,
PromptUpdate,
PromptUpdateDetails,
_seq2tokens,
)
from vllm.multimodal.profiling import BaseDummyInputsBuilder
from vllm.sequence import IntermediateTensors
from vllm.transformers_utils.configs.radio import RadioConfig
from vllm.transformers_utils.tokenizer import AnyTokenizer
from vllm.transformers_utils.tokenizer import (
AnyTokenizer,
cached_tokenizer_from_config,
encode_tokens,
)
from vllm.utils.tensor_schema import TensorSchema, TensorShape

from .utils import _merge_multimodal_embeddings

# Configure PIL to handle large images without warnings
# This prevents DecompressionBombWarning for legitimate large images
Image.MAX_IMAGE_PIXELS = None # Disable the limit entirely
Expand Down Expand Up @@ -382,6 +393,7 @@ def __init__(
max_dynamic_patch: Optional[int] = None,
dynamic_image_size: Optional[bool] = None,
video_token: Optional[str] = None,
video_pruning_rate: Optional[float] = None,
) -> None:
super().__init__(
config=config,
Expand All @@ -392,6 +404,7 @@ def __init__(
)
# add extra video token for video processing
self.video_token = video_token
self.video_pruning_rate = video_pruning_rate

@property
def supports_video(self) -> bool:
Expand Down Expand Up @@ -446,12 +459,38 @@ def _preprocess_video(
),
}

image_size: int = self.config.force_image_size
patch_size: int = self.config.patch_size
downsample_ratio = self.config.downsample_ratio
tokens_per_frame = int(
(image_size * image_size // patch_size**2) * (downsample_ratio**2)
)

for pixel_values in pixel_values_lst_video:
num_patches = pixel_values.shape[0]
num_frames = pixel_values.shape[0]

if (
self.video_pruning_rate is not None
and self.video_pruning_rate > 0.0
):
# Start of EVS-specific code
num_tokens = compute_retained_tokens_count(
tokens_per_frame=tokens_per_frame,
num_frames=num_frames,
q=self.video_pruning_rate,
)

# Here we just need placeholders that won't actually be replaced -
# we just need to make sure the total number of tokens is correct
# assign all tokens to the first frame
tokens_per_frame = [num_tokens] + [0] * (num_frames - 1)

# End of EVS-specific code
else:
tokens_per_frame = [tokens_per_frame] * num_frames
Comment on lines +486 to +490
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid mutating tokens_per_frame across multiple videos

Inside _preprocess_video the variable tokens_per_frame is initialized once outside the loop as an integer per frame, but the loop overwrites it with a list ([num_tokens] + [0] * … or [tokens_per_frame] * num_frames). On the next iteration the same variable is passed back into compute_retained_tokens_count, which now receives a list instead of an int and will raise or generate malformed placeholders whenever more than one video is present. Multi‑video batches with EVS enabled will crash and those without EVS will produce nested lists and incorrect token counts.

Useful? React with 👍 / 👎.


video_repl = self.get_video_repl(tokens_per_frame, self.video_token)

video_repl = self.get_video_repl(
self.num_image_token, num_patches, self.video_token
)
text = [t.replace("<video>", video_repl.full, 1) for t in text]
return text, video_inputs

Expand Down Expand Up @@ -501,20 +540,40 @@ def get_image_repl(

return PromptUpdateDetails.select_text(repl_full, IMG_CONTEXT)

@classmethod
def get_video_repl(
self,
feature_size: int,
num_patches: Optional[int] = None,
cls,
tokens_per_frame: list[int],
video_context_token: str = IMG_CONTEXT,
) -> PromptUpdateDetails[str]:
repl_features = video_context_token * self.num_image_token
repl_features_with_sep = IMG_START + repl_features + IMG_END
# num_patches is equal to num_frames
"""
Build prompt replacement for a video.
The replacement returned is not actually used to replace the placeholder
tokens - it's just used to make sure we allocate the correct number
of tokens.
Actual replacement is done in get_multimodal_embeddings of
NemotronH_Nano_VL_V2
(specifically in _process_video_input -> _create_final_video_embeddings).
There, we create the final embeddings with text embeddings for indicator tokens
and video embeddings for video tokens.
This is a single function that handles all cases - non EVS, EVS dummy, EVS real.
The differentiation is done via tokens_per_frame parameter.
- non EVS case - constant value same value across all frames
- EVS dummy - Doesn't matter how tokens are distributed between frames - just
make sure the total number of tokens is correct.
- EVS real (called from get_real_video_repl_for_evs) - different value per frame
Args:
tokens_per_frame (list[int]): number of tokens per frame
video_context_token (str): the token to use for the video context
"""
repl_full = "".join(
[f"Frame{i + 1}: {repl_features_with_sep}" for i in range(num_patches)]
[
f"Frame{i + 1}: {IMG_START}{video_context_token * num_tokens}{IMG_END}"
for i, num_tokens in enumerate(tokens_per_frame)
]
)

return PromptUpdateDetails.select_text(repl_full, video_context_token)
return PromptUpdateDetails.select_text(repl_full, repl_full)


class BaseNanoNemotronVLProcessingInfo(BaseProcessingInfo):
Expand Down Expand Up @@ -605,6 +664,9 @@ def get_supported_mm_limits(self):
def get_video_token(self) -> Optional[str]:
return IMG_CONTEXT

def get_video_pruning_rate(self) -> Optional[float]:
return self.ctx.get_mm_config().video_pruning_rate

def get_num_frames_with_most_features(
self,
seq_len: int,
Expand All @@ -628,6 +690,7 @@ def get_hf_processor(self, **kwargs: object) -> NanoNemotronVLProcessor:
config=self.get_hf_config(),
tokenizer=self.get_tokenizer(),
video_token=self.get_video_token(),
video_pruning_rate=self.get_video_pruning_rate(),
**kwargs,
)

Expand Down Expand Up @@ -805,8 +868,26 @@ def get_video_replacement_internvl(item_idx: int):
if num_patches is not None:
assert isinstance(num_patches, int)

video_pruning_rate = self.info.ctx.get_mm_config().video_pruning_rate
if video_pruning_rate is not None and video_pruning_rate > 0.0:
# Start of EVS-specific code
num_tokens = compute_retained_tokens_count(
tokens_per_frame=feature_size,
num_frames=num_patches,
q=video_pruning_rate,
)
# Here we just need placeholders that won't actually be replaced -
# we just need to make sure the total number of tokens is correct
# assign all tokens to the first frame
tokens_per_frame = [num_tokens] + [0] * (num_patches - 1)

# End of EVS-specific code
else:
tokens_per_frame = [feature_size] * num_patches

return hf_processor.get_video_repl(
feature_size, num_patches, video_context_token=hf_processor.video_token
tokens_per_frame,
video_context_token=hf_processor.video_token,
)

if self.info.supports_video:
Expand Down Expand Up @@ -913,7 +994,7 @@ def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
config = vllm_config.model_config.hf_config

multimodal_config = vllm_config.model_config.multimodal_config
image_size = config.force_image_size
patch_size = config.patch_size
self.patch_size = patch_size
Expand All @@ -924,7 +1005,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
self.downsample_ratio = config.downsample_ratio
self.ps_version = config.ps_version
self.image_tag_type = config.image_tag_type

self.video_pruning_rate = multimodal_config.video_pruning_rate
self.language_model = init_vllm_registered_model(
vllm_config=vllm_config,
hf_config=config.text_config,
Expand Down Expand Up @@ -957,6 +1038,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
self.img_context_token_id = None
self.video_context_token_id = None
self.config = config
self.model_config = vllm_config.model_config

def pixel_shuffle(self, x, scale_factor=0.5):
n, w, h, c = x.size()
Expand Down Expand Up @@ -1049,7 +1131,7 @@ def _parse_and_validate_image_input(

def _process_image_input(
self, image_input: NanoNemotronVLImageInputs
) -> torch.Tensor:
) -> tuple[torch.Tensor, ...]:
if image_input["type"] == "image_embeds":
return image_input["data"]

Expand All @@ -1071,6 +1153,109 @@ def _process_image_input(
]
return image_embeds.split(image_feature_sizes)

def _process_video_input(
self, video_input: NanoNemotronVLVideoPixelInputs
) -> tuple[torch.Tensor, ...]:
"""Process video input and create final embeddings with video content
and indicator tokens."""
# Get video embeddings using the same processing as images
video_embeddings = self._process_image_input(video_input)

final_video_embeddings: tuple[torch.Tensor, ...] = ()

image_rows = image_cols = self.config.force_image_size
downsample_ratio = self.config.downsample_ratio
patch_size = self.config.patch_size
rows = int(image_rows * downsample_ratio // patch_size)
cols = int(image_cols * downsample_ratio // patch_size)
video_pruning_rate = self.video_pruning_rate

# Calculate video feature dimensions (number of frames and
# their feature size (AKA tokens per frame))
# TODO: Maybe this can be optimized to avoid the loop?
for i, single_video_embeddings in enumerate(video_embeddings):
num_frames = video_input["num_patches"][i].item()
assert single_video_embeddings.shape[0] % num_frames == 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert for input validation in production code can be risky. Assertions can be disabled with Python's -O flag, and they raise a generic AssertionError. It's better to use an explicit if check and raise a ValueError with a descriptive message. This makes the code more robust against unexpected or malformed inputs and prevents potential server crashes.

            if single_video_embeddings.shape[0] % num_frames != 0:
                raise ValueError(
                    f"The number of video embeddings ({single_video_embeddings.shape[0]}) "
                    f"is not divisible by the number of frames ({num_frames})."
                )


if video_pruning_rate is not None and video_pruning_rate > 0.0:
# Start of EVS-specific code
retention_mask = compute_retention_mask(
single_video_embeddings,
video_size_thw=torch.tensor([num_frames, rows, cols]),
spatial_merge_size=1,
q=video_pruning_rate,
)

# apply retention mask
single_video_embeddings = single_video_embeddings[retention_mask]

# calculate the actual number of retained tokens per frame
retention_mask_thw = retention_mask.reshape(num_frames, rows, cols)
num_tokens_per_frame = (
retention_mask_thw.sum(dim=(1, 2)).long().tolist()
)
# End of EVS-specific code
else:
feature_size = single_video_embeddings.shape[0] // num_frames
num_tokens_per_frame = [feature_size] * num_frames

final_video_embeddings += (
self._create_final_video_embeddings(
single_video_embeddings,
num_tokens_per_frame,
),
)

return final_video_embeddings

def _create_final_video_embeddings(
self,
video_embeddings: torch.Tensor,
num_tokens_per_frame: list[int],
) -> torch.Tensor:
"""Create final embeddings that combine video embeddings with
text embeddings of indicator tokens.

These final embeddings contain:
- Actual video embeddings in positions corresponding to video content
- Text embeddings for indicator tokens (<img>, </img>, and
frame separation text) in their respective positions

These embeddings will replace the placeholder embeddings to create
input_embeds for the LLM.
"""
device = video_embeddings.device

# Generate video replacement text and convert to token IDs
video_repl_text = NanoNemotronVLProcessor.get_video_repl(
num_tokens_per_frame,
IMG_CONTEXT,
).full

tokenizer = cached_tokenizer_from_config(self.model_config)
repl_token_ids = torch.tensor(
_seq2tokens(tokenizer, video_repl_text), device=device
)

# Get embedding token IDs for image context
embed_token_ids = torch.tensor(
encode_tokens(tokenizer, IMG_CONTEXT), device=device
)

# Create mask for video embedding positions
is_video_embed = torch.isin(repl_token_ids, embed_token_ids)

# Create final video embeddings, merging text embeddings for indicator
# tokens with video embeddings
text_embeddings = self.get_language_model().get_input_embeddings(repl_token_ids)
final_video_embeddings = _merge_multimodal_embeddings(
inputs_embeds=text_embeddings,
multimodal_embeddings=video_embeddings,
is_multimodal=is_video_embed,
)

return final_video_embeddings

def _parse_and_validate_video_input(
self, **kwargs: object
) -> Optional[NanoNemotronVLVideoPixelInputs]:
Expand Down Expand Up @@ -1152,7 +1337,7 @@ def get_multimodal_embeddings(self, **kwargs: object) -> MultiModalEmbeddings:
multimodal_embeddings += vision_embeddings
if modality == "videos":
video_input = modalities["videos"]
video_embeddings = self._process_image_input(video_input)
video_embeddings = self._process_video_input(video_input)
multimodal_embeddings += video_embeddings

return multimodal_embeddings
Expand Down
8 changes: 6 additions & 2 deletions vllm/model_executor/models/qwen2_5_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -1017,9 +1017,13 @@ def get_replacement_qwen2vl(item_idx: int, modality: str):
and video_pruning_rate is not None
and video_pruning_rate > 0.0
):
T, H, W = map(int, grid_thw)
tokens_per_frame = (H // image_processor.merge_size) * (
W // image_processor.merge_size
)
num_tokens = compute_retained_tokens_count(
grid_thw,
image_processor.merge_size,
tokens_per_frame,
T,
video_pruning_rate,
)
# End of EVS-specific code
Expand Down
Loading
Loading