Skip to content
Merged
Changes from all commits
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
a0119e3
Improve fused MoE‑LoRA kernel indexing and memory access
cwazai Jan 21, 2026
f98d2eb
Remove redundant mask in expert_id_i32 load
cwazai Jan 21, 2026
9816085
Modifications to remove irrelevant optimizations
cwazai Jan 22, 2026
2f4df5b
Kernel (_fused_moe_lora_kernel):
cwazai Jan 23, 2026
2cfbe19
Remove split_k
cwazai Jan 23, 2026
c83d089
1. Rename MAX_LORAS_TOTAL to max_loras for clarity
cwazai Jan 24, 2026
502f11b
resolve conflict
cwazai Jan 25, 2026
8f57d8e
[Bugfix] Force using spawn multiprocess method when it's the WSL plat…
jasonyanwenl Jan 21, 2026
7666091
[Model] Add Eagle2.5-8B Vision-Language Model support (#32456)
George-Polya Jan 21, 2026
4da8f6f
[bugfix] Aria model (#32727)
divakar-amd Jan 21, 2026
7899196
[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414)
robertgshaw2-redhat Jan 21, 2026
abe6851
[Quantization][Deprecation] Remove `DeepSpeedFp8` (#32679)
robertgshaw2-redhat Jan 21, 2026
d1d5acf
[Quantization][Deprecation] Deprecate HQQ (#32681)
robertgshaw2-redhat Jan 21, 2026
1bc92b2
[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (#29287)
ganyi1996ppo Jan 21, 2026
20f38fc
[Quantization][Deprecation] Remove RTN (#32697)
robertgshaw2-redhat Jan 21, 2026
b982de0
[PluggableLayer][1/N] Define PluggableLayer (Fix ci) (#32744)
whx-sjtu Jan 21, 2026
24866b6
Bump Flashinfer to v0.6.1 (#30993)
elvischenv Jan 21, 2026
ac2c273
[Misc] Omit "disable NCCL for DP sync" startup log when not applicabl…
njhill Jan 21, 2026
b6bbb68
[Model Runner V2] Minor refactor for `compute_slot_mappings` (#32794)
WoosukKwon Jan 21, 2026
a72e7d0
Add missing import of fused_topk to benchmark_moe (#32784)
danisereb Jan 21, 2026
8b207bd
[ROCm] fix import for on_gfx9 (#32783)
divakar-amd Jan 21, 2026
3edc1c8
[ModelRunner V2] Don't pin reused flashinfer tensors (#32799)
njhill Jan 21, 2026
c0a8ab1
[Misc] Add Helion version check to collect_env (#32797)
gmagogsfm Jan 21, 2026
af07f6f
[Kernel] Add topk_sigmoid kernel (#31246)
xyang16 Jan 21, 2026
9a46d4d
[Model Runner V2] Refactor Prompt Logprobs (#32811)
WoosukKwon Jan 21, 2026
5cab5f6
[Model Runner V2] Do not error on attention backends (#32820)
WoosukKwon Jan 22, 2026
80e6fcd
[Deprecation] Remove deprecated environment variables (#32812)
yewentao256 Jan 22, 2026
6b078f5
[Bugfix] Fix potential EAGLE spec decode segfault during graph captur…
mawong-amd Jan 22, 2026
42b8d02
[EC Connector] Optimize remote cache check in scheduler (#32585)
knlnguyen1802 Jan 22, 2026
e01a674
Cleanup some huggingface_hub-related stuff (#32788)
Wauplin Jan 22, 2026
4adb5ec
[Docs] Remove outdated async_scheduling limitation with speculative d…
ikaadil Jan 22, 2026
7ab692f
[ROCm][CI] fix get_valid_backends (#32787)
divakar-amd Jan 22, 2026
6a7567c
[FlashMLA] Update FlashMLA to expose new arguments (#32810)
LucasWilkinson Jan 22, 2026
db01ec1
[Llama.py -> mistral.py] Extract mistral-only relevant code into sepa…
patrickvonplaten Jan 22, 2026
78750a7
Upgrade transformers-4.57.5 (#32287)
huydhn Jan 22, 2026
c8b56b7
[ROCm][CI] Lower Acceptance Len Threshold For test_draft_model_quanti…
micah-wil Jan 22, 2026
ba8a3aa
[ROCm][CI] Fix AITER test flakiness by using explicit attention backe…
AndreasKaratzas Jan 22, 2026
c8b2723
[ROCm][CI][Docs] Add comment explaining TRITON_ATTN fallback for ROCm…
AndreasKaratzas Jan 22, 2026
25b5600
[bench] add start_times field to vllm bench serve json result (#32667)
kebe7jun Jan 22, 2026
9dfb58d
[Model] Extend `collect_children` and `no_init_weights` contexts (#32…
DarkLight1337 Jan 22, 2026
6e96888
[AMD][ROCm] MoRI EP: a high-performance all2all backend (#28664)
alexsun07 Jan 22, 2026
830f060
[Misc] Replace urllib's `urlparse` with urllib3's `parse_url` (#32746)
Isotr0py Jan 22, 2026
c0d0a6d
[Benchmark] Don't default to `temperature==0` in `vllm bench serve` (…
njhill Jan 22, 2026
04dc150
Enable Cross layers KV cache layout at NIXL Connector (#30207)
liranschour Jan 22, 2026
b24f914
[Frontend][2/n] Make pooling entrypoints request schema consensus | C…
noooop Jan 22, 2026
d6518d7
[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (#32789)
NickLucche Jan 22, 2026
5ca439b
[CI] refactor release pipeline config into groups (#32833)
Harry-Chen Jan 22, 2026
77467d1
[Frontend] add prompt_cache_key for openresponses (#32824)
chaunceyjiang Jan 22, 2026
7c712dd
OffloadingConnector: Support kernel_block_size != block_size (#30692)
orozery Jan 22, 2026
135099e
[Frontend] Introduce Renderer for processing chat messages (using `Mo…
DarkLight1337 Jan 22, 2026
1e8de71
[Misc][BE] Turn on strict type coverage for vllm/compilation (#31756)
Lucaskabela Jan 22, 2026
c482308
[torch.compile] Improve Cold Start for MoEs (#32805)
zou3519 Jan 22, 2026
bc077a8
[Cleanup] Move scheduler `get_routed_experts` logic to separate metho…
njhill Jan 22, 2026
31d068b
[Misc] Bump opencv-python dependecy version to 4.13 (#32668)
Isotr0py Jan 22, 2026
2b541c5
Support bge-m3 sparse embeddings and colbert embeddings (#14526)
maxdebayser Jan 22, 2026
f2b6c80
[Bugfix] ModelScope is supported when downloading LORA models. (#32844)
AuYang261 Jan 22, 2026
9ca38c1
[Hardware][AMD][CI][Bugfix] Fix regressions from deprecated env vars …
mawong-amd Jan 22, 2026
7a4b69c
[MISC] Add .cursor to .gitignore (#32868)
vadiklyutiy Jan 22, 2026
505dc0e
[UX] Default api_server_count to dp_size if not specified (#32525)
tlrmchlsmth Jan 22, 2026
8ddd79e
Support custom URI schemes and trace handlers for profiler (#32393)
diviramon Jan 22, 2026
f3617c8
[Feature] Add --ssl-ciphers CLI argument for TLS cipher control (#30937)
ricky-chaoju Jan 22, 2026
eda9a3d
[CI][Attention] Add more CI dependencies for attention tests (#32487)
MatthewBonanni Jan 22, 2026
04460ed
[CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inferen…
fadara01 Jan 22, 2026
56eba0b
[Bugfix][Attention] Explicitly report support for kv_cache_dtype bflo…
MatthewBonanni Jan 22, 2026
35bd5e9
Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) (…
eldarkurtic Jan 22, 2026
b2d44ef
[Refactor] Remove unused tpu files (#32610)
yewentao256 Jan 22, 2026
72b043f
[Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper (…
xyang16 Jan 22, 2026
0f4fd95
[BugFix] Fix invalid flashinfer_fused_moe_blockscale_fp8 op registrat…
fadara01 Jan 22, 2026
5a17bba
[MoE Refactor] Move `select_experts` from `FusedMoEQuantMethod` -> `F…
bnellnm Jan 22, 2026
468c45a
[Misc] Log vLLM logo when starting server (#32796)
njhill Jan 23, 2026
4983e12
[BugFix] deepseek_v32_encoding: Replace asserts with proper exception…
RishabhSaini Jan 23, 2026
b6af384
[torch.compile] Compile `CustomOp.forward_native` for `SiluAndMul` an…
ProExpertProg Jan 23, 2026
08e06fa
[CI] Fix mypy for `vllm/v1/structured_output` (#32722)
yewentao256 Jan 23, 2026
325b7d7
[Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set (#3…
karanb192 Jan 23, 2026
1f3bd57
[CI][Models] Add VLM Support for Sequence Classification Conversion (…
AndreasKaratzas Jan 23, 2026
d7fd61a
[Misc] Add `get_name` to missing AttentionBackends (#32698)
NickLucche Jan 23, 2026
4980d0b
[CI/Build][CPU] Fix failed pooling tests and macos smoke test (#32907)
bigPYJ1151 Jan 23, 2026
c85d817
[Voxtral] Add new streaming arch (#32861)
patrickvonplaten Jan 23, 2026
35a2c9a
[Frontend][3/n] Make pooling entrypoints request schema consensus | E…
noooop Jan 23, 2026
6855d37
[CPU Backend][BugFix] Fix failing CPU MoE test (#32876)
fadara01 Jan 23, 2026
3aca787
[Benchmark][Bugfix] Fix race condtion when starting server for sweep …
Isotr0py Jan 23, 2026
8f46305
[CPU][Feat] Update PyTorch to v2.10 for CPU Backend (#32869)
fadara01 Jan 23, 2026
518c8c1
[Feature]: Remove DtoH Copy for lfm2_vl On Default Stream (#32815)
tianshu-Michael-yu Jan 23, 2026
aae07c0
[Bugfix] Fix getting vision features in Transformer Multimodal backen…
zucchini-nlp Jan 23, 2026
6dfd19f
[Bugfix] Disable tma_aligned_scales in test_fusions_e2e (#32916)
xyang16 Jan 23, 2026
69acfe5
[Misc] Postpone torch_profiler deprecation (#32867)
NickLucche Jan 23, 2026
8fa0a36
[Bugfix] Fix FP8 MoE EP Weight Loading for ModelOpt Llama4 (#32886)
baonudesifeizhai Jan 23, 2026
6804f18
[ROCm][PD] Remove unused moriio connector proxy code (#32939)
markmc Jan 23, 2026
04c168d
[Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test (#32904)
mawong-amd Jan 23, 2026
2b48388
[Frontend] add logprob, compression_rate to 'verbose_json' features (…
sangbumlikeagod Jan 23, 2026
76fde85
[torch.compile][CI] Add back attn fusion on hopper/ada (#32940)
ProExpertProg Jan 23, 2026
70ca1f0
[Model] Enable LoRA support for internvl2 (#32397)
MatteoFari Jan 23, 2026
2c86342
[V1][Hybrid] Mamba Prefix Caching with align mode (#30877)
peakcrosser7 Jan 23, 2026
187c02a
[CI][torch nightlies] Use main Dockerfile with flags for nightly torc…
orionr Jan 23, 2026
f1ff77b
[Bugfix][CI] Fix pre-commit (#32956)
MatthewBonanni Jan 23, 2026
d6612ac
[Model Runner V2] Add KV Connector support (#32742)
njhill Jan 23, 2026
6740169
[cudagraphs] Refactor cudagraph capture loop (#32946)
LucasWilkinson Jan 23, 2026
a1cdaf9
fix: Add glm4_moe_lite to MLA detection (#32614)
marksverdhei Jan 23, 2026
e4b1fc1
[Bug] Fix benchmark script `moe_permute_unpermute` (#32949)
yewentao256 Jan 23, 2026
ffa2efc
[CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to…
rasmith Jan 23, 2026
8000ab5
[Refactor] Rename `gptq_marlin` to `marlin` to match MoE (#32952)
mgoin Jan 23, 2026
0108f4a
[Refactor] Clean up unused variables & func (#32692)
yewentao256 Jan 23, 2026
75311f3
[Bugfix] Fix missing is_layer_skipped check for FusedMoE in AWQConfig…
joninco Jan 23, 2026
3baef7f
[CI] fix version comparsion and exclusion patterns in upload-release-…
Harry-Chen Jan 23, 2026
fc5ed11
[fix] add VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to compile factors (#32…
dolpm Jan 23, 2026
0b063f2
[Performance] Split FlashAttn attention and cache update (#25954)
ElizaWszola Jan 24, 2026
b4263fb
[Core][Bugfix] allow graceful worker termination (#32965)
joerunde Jan 24, 2026
03c481d
[ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 (#32…
monajafi-amd Jan 24, 2026
ec28238
Auth_token added in documentation as it is required (#32988)
ruizcrp Jan 24, 2026
086239b
[Dev UX] Add auto-detection for VLLM_PRECOMPILED_WHEEL_VARIANT during…
mgoin Jan 24, 2026
9223b58
[Perf] Cache xpu_get_mem_info() result to avoid duplicate calls (#32983)
sjhddh Jan 24, 2026
039d53c
[Tests] Clarify pytest skip reasons with actionable context (#32981)
sjhddh Jan 24, 2026
5975c3a
[Tests] Standardize RNG seed utility across test files (#32982)
sjhddh Jan 24, 2026
6dcd724
[docs] Update governance process links (#32995)
esmeetu Jan 24, 2026
e54514e
[Models] Add `SharedFusedMoE` support to Qwen3MoE (#32082)
Isotr0py Jan 24, 2026
77935f8
[Doc] Ignore typo check on doc (#32999)
ywang96 Jan 24, 2026
44e22be
feat(benchmark): add encoder forward pass benchmarking to mm-processo…
reaganjlee Jan 24, 2026
73d908e
[UX] Deduplicate sampling parameter startup logs (#32953)
DarkLight1337 Jan 24, 2026
fdb1cfb
[Perf] Cache exc.errors() result in validation exception handler (#32…
sjhddh Jan 24, 2026
1af9143
[Bugfix] Fix E2E latency calculation and add warmup support in mm_pro…
HirokenOvo Jan 24, 2026
bc2a887
[Models]: Make Multimodal config implicit in ViT implementation (#31972)
Isotr0py Jan 24, 2026
89d4644
feat: Complete LoRA support for MiniMaxM2 Fixes #32736 (#32763)
Chenhao-Guan Jan 24, 2026
173291f
[EncoderCacheManager] Remove unnecessary copy (#32800)
lgeiger Jan 24, 2026
eb596f7
[Bugfix]: resolve torch.compile cache conflict between mm_encoder_tp_…
HirokenOvo Jan 24, 2026
4c3cdc1
Update CPU doc according to feedback (#32963)
louie-tsai Jan 24, 2026
bb70650
[MLA] Fuse cat and qaunt for fp8 kv-cache (#32950)
LucasWilkinson Jan 24, 2026
7203206
[Tests] Replace flaky sleep with polling in test_background_cancel (#…
sjhddh Jan 24, 2026
94cc8d7
[CPU Backend][BugFix] Fix failing Darwin pipelines (#33002)
fadara01 Jan 24, 2026
0f21271
[CPU] Improve CPU Docker build (#30953)
maryamtahhan Jan 24, 2026
3cdaa87
[Feature] add session based streaming input support to v1 (#28973)
joshuadeng Jan 24, 2026
196f8b8
[DOC] [ROCm] Update doc for v0.14.1 (#32998)
tjtanaa Jan 25, 2026
b3deb77
[Perf][Kernel] Optimize FP4 quantization kernels (SM100F) (#32520)
LopezCastroRoberto Jan 25, 2026
2aac4da
[Docs] Fix Apple silicon include path in CPU installation docs (#32977)
sjhddh Jan 25, 2026
2776a97
[Bugfix] fix encoder cache hang in Qwen3VL (#32684)
JJJYmmm Jan 25, 2026
a3c086a
[BugFix] Add env variable to control PDL in LoRA (#32836)
jeejeelee Jan 25, 2026
67fce76
Merge branch 'main' into pr1/moe-lora-kernel-opt
cwazai Jan 25, 2026
a680781
Merge branch 'main' into pr1/moe-lora-kernel-opt
jeejeelee Jan 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 23 additions & 7 deletions vllm/lora/ops/triton_ops/fused_moe_lora_op.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def _fused_moe_lora_kernel(
num_experts,
lora_ids,
adapter_enabled,
max_loras, # <<< PR2: rename, used for masks when grid axis-2 != max_loras
# The stride variables represent how much to increase the ptr by when
# moving by 1 element in a particular dimension. E.g. `stride_am` is
# how much to increase `a_ptr` by to get the element one row down
Expand All @@ -83,6 +84,7 @@ def _fused_moe_lora_kernel(
num_slice_c: tl.constexpr,
top_k: tl.constexpr,
MUL_ROUTED_WEIGHT: tl.constexpr,
USE_B_L2_CACHE: tl.constexpr, # new, enable .ca load for B
BLOCK_SIZE_M: tl.constexpr,
BLOCK_SIZE_N: tl.constexpr,
BLOCK_SIZE_K: tl.constexpr,
Expand All @@ -104,10 +106,13 @@ def _fused_moe_lora_kernel(
if moe_enabled == 0:
# Early exit for the no moe lora case.
return
# The grid size on axis 2 is (max_loras + 1) to handle the no-lora case
# (lora_id == -1), but sorted_token_ids and expert_ids are allocated with
# shape (max_loras, ...). Use (num_programs - 1) for correct bounds checking.
max_loras = tl.num_programs(axis=2) - 1
# The grid's axis-2 dimension is max_loras + 1 to accommodate the -1 sentinel.
# This guard ensures we don't access sorted_token_ids / expert_ids /
# num_tokens_post_padded beyond their allocated bounds if an invalid
# lora_id somehow appears. Although the caller should pass correct
# max_loras, defensive programming prevents accidental out-of-bounds.
if lora_id >= max_loras:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment on why we need to add this case. Technically since you are passing it correctly it shouldn't go in here but ok to have this check.

return
grid_k = tl.cdiv(K, BLOCK_SIZE_K * SPLIT_K)

# calculate pid_m,pid_n
Expand Down Expand Up @@ -136,10 +141,11 @@ def _fused_moe_lora_kernel(
cur_b_ptr = tl.load(b_ptr + slice_id).to(tl.pointer_type(c_ptr.dtype.element_ty))
cur_c_ptr = c_ptr + (slice_id % num_slice_c) * slice_c_size

offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N).to(tl.int64)) % N
# remove modulo wrap-around
offs_bn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N).to(tl.int32)
offs_k = pid_sk * BLOCK_SIZE_K + tl.arange(0, BLOCK_SIZE_K)

offs_token_id = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M).to(tl.int64)
offs_token_id = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M).to(tl.int32)
token_ind = stride_tl * lora_id + offs_token_id
offs_token = tl.load(
sorted_token_ids_ptr + token_ind,
Expand Down Expand Up @@ -176,7 +182,13 @@ def _fused_moe_lora_kernel(
# GDC wait waits for ALL programs in the prior kernel to complete
# before continuing.
# pre-fetch lora weight
Copy link
Copy Markdown
Collaborator

@jeejeelee jeejeelee Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete these comments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. This was an oversight on my part—I originally intended to finalize these adjustments in the final step once everything was confirmed ready to merge. I’ll restore the comments right away.

Thanks again for the careful review.

b = tl.load(b_ptrs, mask=offs_k[:, None] < k_remaining, other=0.0)
# add (offs_bn < N) mask; optional .ca for B
b_mask = (offs_k[:, None] < k_remaining) & (offs_bn[None, :] < N)
if USE_B_L2_CACHE:
b = tl.load(b_ptrs, mask=b_mask, other=0.0, cache_modifier=".ca")
else:
b = tl.load(b_ptrs, mask=b_mask, other=0.0)

if USE_GDC and not IS_PRIMARY:
tl.extra.cuda.gdc_wait()
a = tl.load(
Expand Down Expand Up @@ -276,6 +288,7 @@ def _fused_moe_lora_shrink(
num_experts,
lora_ids,
adapter_enabled,
lora_a_stacked[0].shape[0],
qcurr_hidden_states.stride(0),
qcurr_hidden_states.stride(1),
w1_lora_a_stacked.stride(0),
Expand All @@ -292,6 +305,7 @@ def _fused_moe_lora_shrink(
num_slice_c=num_slices,
top_k=1 if mul_routed_weight else top_k_num,
MUL_ROUTED_WEIGHT=False,
USE_B_L2_CACHE=True, # new
IS_PRIMARY=True,
**shrink_config,
)
Expand Down Expand Up @@ -377,6 +391,7 @@ def _fused_moe_lora_expand(
num_experts,
lora_ids,
adapter_enabled,
lora_b_stacked[0].shape[0],
a_intermediate_cache1.stride(0),
a_intermediate_cache1.stride(1),
w1_lora_b_stacked.stride(0),
Expand All @@ -393,6 +408,7 @@ def _fused_moe_lora_expand(
num_slice_c=num_slices,
top_k=1,
MUL_ROUTED_WEIGHT=mul_routed_weight,
USE_B_L2_CACHE=True, # new
IS_PRIMARY=False,
**expand_config,
)
Expand Down