[LoRA] Initial EP support for LoRA by jeejeelee · Pull Request #40867 · vllm-project/vllm

jeejeelee · 2026-04-25T07:19:22Z

Purpose

Depends on #40338

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

gemini-code-assist

Code Review

This pull request implements support for Expert Parallelism (EP) within Fused MoE LoRA layers, refactoring the implementation from a decorator-based approach to a more robust system using MoELoRAContext and LoRAExpertsMixin. The changes enable native LoRA handling in Triton and Marlin expert kernels and include updates to the Punica wrapper for handling rank-local token mappings and expert slicing. Feedback highlights a bug in token mapping logic when sequence parallelism is active and a potential TypeError in non-gated MoE models where specific LoRA weights might be None.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Jackmin801 · 2026-04-29T03:32:14Z

+        self._lora_context = None
+
+    def set_lora_context(self, ctx) -> None:
+        self._lora_context = ctx


I would think it's cleaner if the PrepareFinalize doesnt have to take the entire context. With regards to lora, mk.FusedMoEPrepareAndFinalizeModular is only concerned with taking in lora_id in prepare and getting out the local_lora_id. This is easy to unit test.

If it takes the entire lora context, now it needs to be concerned about constructing it and its punica wrapper and using the punica wrapper correctly so that lora_ctx.punica_wrapper.token_mapping_meta.token_lora_mapping is correct. 4 attributes before we can get to the input we want seems a bit much.

For now, we want to capture the full LoRA context upfront for better extensibility, but we can consider only passing lora_id in the future.

Jackmin801 · 2026-04-29T03:37:15Z

+            local_token_lora_mapping = (
+                lora_ctx.punica_wrapper.token_mapping_meta.token_lora_mapping[
+                    : a1.shape[0]


Im not sure why this slicing is needed. punica wrapper already slices token_lora_mapping to the seqlen right? I think before #39107 it was necessary because MoE DP chunking would make the input chunked which the punica wrapper doesnt know about. But would think it isnt necessary now that DP chunking is no longer supported.

Jackmin801 · 2026-04-29T03:44:55Z

+        # EP on the expert dim, fully_sharded on the LoRA rank dim — with
+        # mutually contradictory assumptions about which rank holds which
+        # expert's rank-shard.
+        assert not (self.base_layer.use_ep and lora_config.fully_sharded_loras), (


Out of curiosity, do you know anyone using this fully_sharded_loras feature? At prime, we had some weird bugs with it so we never use it and would think that it is basically solved with expert parallel. You'd never want to be using this feature.

@HollowMan6 I know your team tried fully_sharded_loras, right?

Yes. It generally works okay, except this bug #35077 (comment) But once LoRA + EP is supported, I don't think we need to have support for it to be enabled at the same time.

Jackmin801 · 2026-04-29T03:48:01Z

+        # Under EP the adapter tensors carry all global experts; slice this
+        # rank's owned range so downstream shapes line up with local buffers.
+        global_num_experts = self.base_layer.global_num_experts
+        ep_rank = self.base_layer.ep_rank
+        if (
+            w1_lora_a.shape[0] == global_num_experts
+            and num_experts != global_num_experts
+        ):
+            expert_start = ep_rank * num_experts
+            expert_end = expert_start + num_experts
+            w1_lora_a = w1_lora_a[expert_start:expert_end]
+            w2_lora_a = w2_lora_a[expert_start:expert_end]
+            w3_lora_a = w3_lora_a[expert_start:expert_end]
+            w1_lora_b = w1_lora_b[expert_start:expert_end]
+            w2_lora_b = w2_lora_b[expert_start:expert_end]
+            w3_lora_b = w3_lora_b[expert_start:expert_end]


Should this slicing be moved to load instead? If it's here in the set which does (CPU -> GPU). That means the cpu LoRAModels that LoRAModelManager holds have all the loras? If it's moved to load then it's "pre-sliced" at load time.

Just to keep us on the same page. I categorized the concerns that happen in the lora code into load, add and set. So by moving to load here I mean that the logic should be moved the load in WorkerLoRAManager.

Makes sense

Jackmin801 · 2026-04-29T04:06:51Z

+                if module.__class__.__name__ == "FusedMoEWithLoRA":
+                    replacements = replacements[
+                        : len(module.lora_a_stacked) // self.lora_slots
+                    ]


Im actually kind of lost as to what is happening here 😓 will read in detail later. But just a quick question out of curiosity. Why do we do this packing at add time? Can we pack at load time and make the add and set simple?

im trying to practice my chinese writing hehe. but for non-chinese readers. Im asking if this logic can be moved here.

One benefit of moving it is that it makes the loading more efficient. We dont need to allocate all the small 2D MoE tensors at load time then pack them into 3D at add time. We can instead just allocate in 3D and load the 2D slices into it with local expert subsetting!

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee added 30 commits April 19, 2026 15:44

Init

85cf592

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Move

9c64777

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Address conflict

7b81c9e

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into moe-lora-refactor

80d0188

Fix

9350a67

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Fix

b3d1ea6

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into moe-lora-refactor

13caeb4

fix conflict

1455838

fix conflict

51be12a

Move

c2dbb14

Move

4f3b7f9

Move

550e19d

Fix

019cfa1

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Fix

d1ae808

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Remove unrelated change

61d7746

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Move

ea4a8fd

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Move

5600221

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Move

cd29a49

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Add lora experts mixin

7707bf3

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

OPT

166386e

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

FMT

6fe6601

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

FMT

99c00a2

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into moe-lora-refactor

7c855e4

Merge branch 'main' into moe-lora-refactor

ce0f6c3

Init

9495872

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Move

5c1fe18

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into moe-lora-refactor

3efd6c5

Merge branch 'moe-lora-refactor' into moe-lora-ep

bc9b997

Move

fe00d8c

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Move

400d6cd

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee marked this pull request as draft April 25, 2026 07:19

jeejeelee removed their assignment Apr 25, 2026

mergify Bot added qwen Related to Qwen models gpt-oss Related to GPT-OSS models labels Apr 25, 2026

github-project-automation Bot added this to gpt-oss Issues & Enhancements Apr 25, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Apr 25, 2026

Merge branch 'main' into moe-lora-ep

57cab35

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/prepare_finalize/naive_dp_ep.py

Comment thread vllm/lora/layers/fused_moe.py

jeejeelee added 3 commits April 26, 2026 05:20

FIX

381c45c

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge remote-tracking branch 'origin/main' into moe-lora-ep

08e33a6

Merge branch 'main' into moe-lora-ep

fbc1a61

HollowMan6 suggested changes Apr 27, 2026

View reviewed changes

Comment thread vllm/lora/model_manager.py

github-project-automation Bot moved this from To Triage to In progress in gpt-oss Issues & Enhancements Apr 27, 2026

jeejeelee and others added 3 commits April 28, 2026 08:39

Update vllm/lora/model_manager.py

9d244b3

Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

FMT

a264079

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into moe-lora-ep

5e93587

Jackmin801 reviewed Apr 29, 2026

View reviewed changes

Comment thread vllm/lora/layers/fused_moe.py

Jackmin801 reviewed Apr 29, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py

Jackmin801 reviewed Apr 29, 2026

View reviewed changes

jeejeelee added 3 commits April 29, 2026 15:07

Move

6fa9268

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Move

4172be4

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge remote-tracking branch 'origin/main' into moe-lora-ep

1cea06b

stecasta mentioned this pull request Apr 30, 2026

[Doc] Fix RTD build: pytorch.org/docs/stable/objects.inv returns 404 #41353

Merged

4 tasks

Move

bf3d2a8

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee force-pushed the moe-lora-ep branch from 67faff0 to bf3d2a8 Compare May 1, 2026 03:28

Merge branch 'main' into moe-lora-ep

a8311e7

Uh oh!

Conversation

jeejeelee commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeejeelee Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackmin801 Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeejeelee commented Apr 25, 2026 •

edited

Loading

jeejeelee Apr 29, 2026 •

edited

Loading

Jackmin801 Apr 29, 2026 •

edited

Loading