Add LMF2 MoE model architecture by tugot17 · Pull Request #17997 · sgl-project/sglang

tugot17 · 2026-01-30T13:48:09Z

This PR introduces Liquid Foundation Model Mixture of Experts architecture.

Example model using this architecture: LFM2-8B-A1B

How to run

sglang serve --model-path LiquidAI/LFM2-8B-A1B --tool-call-parser lfm2

Benchmarks

GPQA Dimond: 34.04 vs. 29.29 reported
IFBench: 26.53 vs. 25.85 reported

Integration test for function calling:

pytest test/registered/openai_server/function_call/test_tool_choice.py::TestToolChoiceLfm2Moe -v -s 

================================ 12 passed, 2 skipped, 2 warnings in 69.84s (0:01:09) ================================

We skip 2 tests related to this issue: #17998

Numerics

{
  "prompt": "<|startoftext|><|im_start|>user\nThe capital of the United Kingdom is<|im_end|>\n<|im_start|>assistant\n",
  "hf_output": "The capital of the United Kingdom is **London**. While the UK does not have a single \"capital city\" in the traditional sense\u2014since it is",
  "sglang_output": "The capital of the United Kingdom is **London**. While the UK does not have a single \"capital city\" in the traditional sense\u2014since it is",
  "rouge_l": 1.0,
  "prefill_max_diff": 0.6650424003601074,
  "prefill_mean_diff": 0.1140863299369812,
  "decode_max_diff": 0.2501058578491211,
  "decode_mean_diff": 0.07756583392620087
},
{
  "prompt": "<|startoftext|><|im_start|>user\nToday is a sunny day and I like<|im_end|>\n<|im_start|>assistant\n",
  "hf_output": "Today is a sunny day and I like nothing more than stepping outside to feel the warm sun on my skin. There\u2019s something magical about the golden light filtering through",
  "sglang_output": "Today is a sunny day and I like nothing more than stepping outside to feel the warm sun on my skin. There\u2019s something magical about the golden light filtering through",
  "rouge_l": 1.0,
  "prefill_max_diff": 0.05985307693481445,
  "prefill_mean_diff": 0.01562848500907421,
  "decode_max_diff": 0.031732797622680664,
  "decode_mean_diff": 0.013126413337886333
},
{
  "prompt": "<|startoftext|><|im_start|>user\nAI is a field of computer science focused on<|im_end|>\n<|im_start|>assistant\n",
  "hf_output": "AI (Artificial Intelligence) is a field of computer science focused on creating systems or machines that can perform tasks requiring human-like intelligence. These tasks include learning from",
  "sglang_output": "AI (Artificial Intelligence) is a field of computer science focused on creating systems or machines that can perform tasks requiring human-like intelligence. These tasks include learning from",
  "rouge_l": 1.0,
  "prefill_max_diff": 0.4699575901031494,
  "prefill_mean_diff": 0.06591875851154327,
  "decode_max_diff": 0.7499990463256836,
  "decode_mean_diff": 0.09959111362695694
}

We also tested on TP2:

Capturing batches (bs=1 avail_mem=10.40 GB): 100%|███████████████████████████████████████| 36/36 [00:03<00:00, 10.33it/s]
[2026-01-29 19:56:23 TP0] Registering 1764 cuda graph addresses
[2026-01-29 19:56:24 TP0] Capture cuda graph end. Time elapsed: 4.04 s. mem usage=0.40 GB. avail mem=10.40 GB.
[2026-01-29 19:56:24 TP1] Capture cuda graph end. Time elapsed: 4.07 s. mem usage=0.40 GB. avail mem=10.40 GB.
[2026-01-29 19:56:24 TP0] max_total_num_tokens=5225878, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=10.40 GB
[2026-01-29 19:56:24] INFO:     Started server process [12519]
[2026-01-29 19:56:24] INFO:     Waiting for application startup.

Server works!

Note: This PR relies on support for TP in Liquid Foundation Models being supported, the thing introduced in this PR:
#17777

we can merge this at the same time, or first fix TP and than these changes

gemini-code-assist · 2026-01-30T13:48:41Z

Summary of Changes

Hello @tugot17, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Liquid Foundation Model Mixture of Experts (LFM2-MoE) architecture, a hybrid model combining attention, ShortConv, and MoE layers. It introduces a dedicated configuration, implements the model's various components including a dynamic decoder layer that switches between dense MLP and sparse MoE blocks, and ensures compatibility with tensor parallelism. The changes also include refined weight loading mechanisms and initial integration tests to validate the new model's functionality.

Highlights

LFM2 MoE Architecture: Introduced the Liquid Foundation Model Mixture of Experts (LFM2-MoE) architecture, enabling support for models like LFM2-8B-A1B.
New Configuration: Added a dedicated Lfm2MoeConfig class to define the specific parameters for the LFM2-MoE model, including MoE-specific settings like num_dense_layers, num_experts, and routing mechanisms.
Hybrid Layer Implementation: Implemented Lfm2MoeDecoderLayer which dynamically uses either a dense MLP or a Sparse MoE block based on the layer ID, combining attention and ShortConv layers.
Tensor Parallelism for ShortConv: Enhanced Lfm2ShortConv to support tensor parallelism, sharding hidden dimensions and utilizing MergedColumnParallelLinear for efficient weight handling.
Optimized MoE Handling: Integrated FusedMoE and TopK for efficient batched expert computation and sigmoid routing in Lfm2MoeSparseMoeBlock.
Weight Loading Improvements: Updated weight loading logic across lfm2.py and lfm2_moe.py to correctly handle stacked parameters, sharded conv weights, and FusedMoE expert formats.
Integration Testing: Added a new test class TestToolChoiceLfm2Moe to validate tool choice functionality with the LFM2-MoE model, with specific tests skipped due to known issues.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the LMF2-MoE model architecture, including a new configuration and model implementation. It also refactors the existing LMF2 model to add tensor parallelism support for its convolution layers, a change that is leveraged in the new MoE version. The changes are well-structured and include necessary updates to model loading, configuration, and testing. I've identified a potential bug in the weight loading logic for convolution layers that appears in both the updated lfm2.py and the new lfm2_moe.py files, which I've detailed in the comments.

gemini-code-assist · 2026-01-30T14:07:31Z

python/sglang/srt/models/lfm2.py

+            if ".conv.conv.weight" in name:
+                name = name.replace(".conv.conv.weight", ".conv.conv_weight")
+                loaded_weight = loaded_weight.squeeze(1)  # (D, 1, K) -> (D, K)
+            if ".conv.conv.bias" in name:
+                name = name.replace(".conv.conv.bias", ".conv.conv_bias")


The replace calls for .conv.conv.weight and .conv.conv.bias are no-ops as they replace the string with itself. If the intention is to rename the parameter from the Hugging Face model to match the name in this implementation, the replacement string should be different. For example, if the SGLang parameter is named ...conv.conv_weight, the replacement should reflect that. Given that the parameter names seem to be identical between the HF model and this implementation (...conv.conv.weight), this block might be unnecessary. Please clarify the intent or correct the replacement logic.

gemini-code-assist · 2026-01-30T14:07:31Z

python/sglang/srt/models/lfm2_moe.py

+            if ".conv.conv.weight" in name:
+                name = name.replace(".conv.conv.weight", ".conv.conv_weight")
+                loaded_weight = loaded_weight.squeeze(1)  # (D, 1, K) -> (D, K)
+            if ".conv.conv.bias" in name:
+                name = name.replace(".conv.conv.bias", ".conv.conv_bias")


Similar to the lfm2.py model, the replace calls for .conv.conv.weight and .conv.conv.bias are no-ops because they replace the string with itself. If a renaming is needed to load the Hugging Face model weights, the replacement string should be corrected. If the names are already aligned, this block of code can be removed.

ispobock · 2026-02-08T02:38:26Z

@tugot17 could you fix the lint first?

ispobock · 2026-02-08T05:20:03Z

.github/workflows/internal-release-lint.yml

@@ -0,0 +1,24 @@
+name: Internal Release Lint


Why update the lint & pr-test workflows here?

yes sorry I mixed up this from my internal branch - will remove

- Replace nn.Linear with ColumnParallelLinear/RowParallelLinear - Shard conv_weight and conv_bias along hidden dimension - Add sharded_weight_loader for proper weight loading with TP - Update forward methods to handle parallel linear tuple returns This enables LFM2 and LFM2-MoE to run with tensor parallelism > 1.

…format

tugot17 · 2026-02-08T20:24:23Z

@ispobock fixed, sorry went one commit to far :)

JustinTong0323 · 2026-02-09T04:10:00Z

test/registered/openai_server/function_call/test_tool_choice.py

+class TestToolChoiceLfm2Moe(TestToolChoiceLlama32):
+    """Test tool_choice functionality with LiquidAI LFM2-MoE model"""
+
+    @classmethod
+    def setUpClass(cls):
+        cls.flaky_tests = {
+            "test_multi_tool_scenario_auto",
+            "test_multi_tool_scenario_required",
+        }
+
+        cls.model = "LiquidAI/LFM2-8B-A1B"
+        cls.base_url = DEFAULT_URL_FOR_TEST
+        cls.api_key = "sk-123456"
+
+        cls.process = popen_launch_server(
+            cls.model,
+            cls.base_url,
+            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+            api_key=cls.api_key,
+            other_args=[
+                "--tool-call-parser",
+                "lfm2",
+            ],
+        )
+        cls.base_url += "/v1"
+        cls.tokenizer = get_tokenizer(cls.model)
+
+    @unittest.skip("maxItems:1 bug causes whitespace stall")
+    def test_tool_choice_required_non_streaming(self):
+        pass
+
+    @unittest.skip("maxItems:1 bug causes whitespace stall")
+    def test_tool_choice_specific_function_non_streaming(self):
+        pass
+
+


Since we've already tested the LFM2 parser functionality above and the 8B model is relatively heavy for this CI, I think we should remove it.

it is the only MoE model in the test, might be useful for testing if the FusedMoE kernel didn't cause some regressions, and as far as I know this might be the smallest MoE model in SGLang (correct me if I'm wrong)

Ah then maybe it's better to add another gsm8k test elsewhere appropriate, as this CI is only for tool call test?

JustinTong0323 · 2026-02-09T05:20:01Z

Tested with gsm8k:

ccuracy: 0.830
Invalid: 0.000
Latency: 6.183 s
Output throughput: 2892.703 token/s

And mmlu:

Total latency: 58.788
Average accuracy: 0.648

Matches the official number, thanks for the support!

ispobock · 2026-02-09T15:28:38Z

/tag-and-rerun-ci

ChangyiYang · 2026-02-10T00:36:29Z

python/sglang/srt/configs/lfm2_moe.py

+        return [i for i, lt in enumerate(self.layer_types) if lt == "full_attention"]
+
+    @property
+    def linear_layer_ids(self) -> List[int]:


I understand this is intended to reuse the mamba2 cache, but the naming here feels a bit odd. Maybe better refactor this property's name to be something like linear_att_layer_ids in the future. (Not blocking, just noticed when reading)

tugot17 requested review from Fridge003, Ying1123, hnyls2002, ispobock and merrymercy as code owners January 30, 2026 13:48

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

tugot17 requested a review from Kangyan-Zhou as a code owner January 30, 2026 15:44

ispobock reviewed Feb 8, 2026

View reviewed changes

tugot17 force-pushed the feature/lfm2-moe-clean branch from 27f7135 to cf2aec1 Compare February 8, 2026 20:08

tugot17 added 11 commits February 8, 2026 20:16

Initial version of LMF2 MoE added

e81e76f

Add fused MoE kernel to improve the throughput

e999163

Improve code qualtiy for LFM2 MoE

3c3b7d6

Add function calling integration test for LFM2 MoE

4d2dcc6

Use the conv1d type from the environment variable

ace8c5e

Fix TP support in LFM2 configs (num_heads must be divisible by tp_size)

0d8e70a

Fix conv weight loading: HF uses conv.conv.weight not conv.weight

91b15be

Use MergedColumnParallelLinear for in_proj to fix TP sharding

8038282

Fix linting: restore ColumnParallelLinear import, remove unused var, …

a008456

…format

Skip MoE tool_choice tests affected by maxItems:1 bug

929a12c

tugot17 force-pushed the feature/lfm2-moe-clean branch from cf2aec1 to 929a12c Compare February 8, 2026 20:19

Fix isort in lfm2_moe config

fd9b172

JustinTong0323 reviewed Feb 9, 2026

View reviewed changes

github-actions bot added the run-ci label Feb 9, 2026

ChangyiYang reviewed Feb 10, 2026

View reviewed changes

Merge branch 'main' into feature/lfm2-moe-clean

e704b0f

ispobock merged commit ded068a into sgl-project:main Feb 11, 2026
141 of 156 checks passed

alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Feb 11, 2026

Add LMF2 MoE model architecture (sgl-project#17997)

b9bd4ef

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

Add LMF2 MoE model architecture (sgl-project#17997)

c1f2c06

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

Add LMF2 MoE model architecture (sgl-project#17997)

34425d8

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Add LMF2 MoE model architecture (sgl-project#17997)

63cd55e

Conversation

tugot17 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run

Benchmarks

Numerics

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

ispobock commented Feb 8, 2026

Uh oh!

ispobock Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

tugot17 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

tugot17 commented Feb 8, 2026

Uh oh!

JustinTong0323 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

tugot17 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 commented Feb 9, 2026

Uh oh!

ispobock commented Feb 9, 2026

Uh oh!

ChangyiYang Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tugot17 commented Jan 30, 2026 •

edited

Loading

ChangyiYang Feb 10, 2026 •

edited

Loading