Skip to content

Add LMF2 MoE model architecture#17997

Merged
ispobock merged 13 commits intosgl-project:mainfrom
tugot17:feature/lfm2-moe-clean
Feb 11, 2026
Merged

Add LMF2 MoE model architecture#17997
ispobock merged 13 commits intosgl-project:mainfrom
tugot17:feature/lfm2-moe-clean

Conversation

@tugot17
Copy link
Contributor

@tugot17 tugot17 commented Jan 30, 2026

This PR introduces Liquid Foundation Model Mixture of Experts architecture.

Example model using this architecture: LFM2-8B-A1B

How to run

sglang serve --model-path LiquidAI/LFM2-8B-A1B --tool-call-parser lfm2  

Benchmarks

GPQA Dimond: 34.04 vs. 29.29 reported
IFBench: 26.53 vs. 25.85 reported

Integration test for function calling:

pytest test/registered/openai_server/function_call/test_tool_choice.py::TestToolChoiceLfm2Moe -v -s 

================================ 12 passed, 2 skipped, 2 warnings in 69.84s (0:01:09) ================================

We skip 2 tests related to this issue: #17998

Numerics

{
  "prompt": "<|startoftext|><|im_start|>user\nThe capital of the United Kingdom is<|im_end|>\n<|im_start|>assistant\n",
  "hf_output": "The capital of the United Kingdom is **London**. While the UK does not have a single \"capital city\" in the traditional sense\u2014since it is",
  "sglang_output": "The capital of the United Kingdom is **London**. While the UK does not have a single \"capital city\" in the traditional sense\u2014since it is",
  "rouge_l": 1.0,
  "prefill_max_diff": 0.6650424003601074,
  "prefill_mean_diff": 0.1140863299369812,
  "decode_max_diff": 0.2501058578491211,
  "decode_mean_diff": 0.07756583392620087
},
{
  "prompt": "<|startoftext|><|im_start|>user\nToday is a sunny day and I like<|im_end|>\n<|im_start|>assistant\n",
  "hf_output": "Today is a sunny day and I like nothing more than stepping outside to feel the warm sun on my skin. There\u2019s something magical about the golden light filtering through",
  "sglang_output": "Today is a sunny day and I like nothing more than stepping outside to feel the warm sun on my skin. There\u2019s something magical about the golden light filtering through",
  "rouge_l": 1.0,
  "prefill_max_diff": 0.05985307693481445,
  "prefill_mean_diff": 0.01562848500907421,
  "decode_max_diff": 0.031732797622680664,
  "decode_mean_diff": 0.013126413337886333
},
{
  "prompt": "<|startoftext|><|im_start|>user\nAI is a field of computer science focused on<|im_end|>\n<|im_start|>assistant\n",
  "hf_output": "AI (Artificial Intelligence) is a field of computer science focused on creating systems or machines that can perform tasks requiring human-like intelligence. These tasks include learning from",
  "sglang_output": "AI (Artificial Intelligence) is a field of computer science focused on creating systems or machines that can perform tasks requiring human-like intelligence. These tasks include learning from",
  "rouge_l": 1.0,
  "prefill_max_diff": 0.4699575901031494,
  "prefill_mean_diff": 0.06591875851154327,
  "decode_max_diff": 0.7499990463256836,
  "decode_mean_diff": 0.09959111362695694
}

We also tested on TP2:

Capturing batches (bs=1 avail_mem=10.40 GB): 100%|███████████████████████████████████████| 36/36 [00:03<00:00, 10.33it/s]
[2026-01-29 19:56:23 TP0] Registering 1764 cuda graph addresses
[2026-01-29 19:56:24 TP0] Capture cuda graph end. Time elapsed: 4.04 s. mem usage=0.40 GB. avail mem=10.40 GB.
[2026-01-29 19:56:24 TP1] Capture cuda graph end. Time elapsed: 4.07 s. mem usage=0.40 GB. avail mem=10.40 GB.
[2026-01-29 19:56:24 TP0] max_total_num_tokens=5225878, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=10.40 GB
[2026-01-29 19:56:24] INFO:     Started server process [12519]
[2026-01-29 19:56:24] INFO:     Waiting for application startup.

Server works!

Note: This PR relies on support for TP in Liquid Foundation Models being supported, the thing introduced in this PR:
#17777

we can merge this at the same time, or first fix TP and than these changes

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tugot17, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Liquid Foundation Model Mixture of Experts (LFM2-MoE) architecture, a hybrid model combining attention, ShortConv, and MoE layers. It introduces a dedicated configuration, implements the model's various components including a dynamic decoder layer that switches between dense MLP and sparse MoE blocks, and ensures compatibility with tensor parallelism. The changes also include refined weight loading mechanisms and initial integration tests to validate the new model's functionality.

Highlights

  • LFM2 MoE Architecture: Introduced the Liquid Foundation Model Mixture of Experts (LFM2-MoE) architecture, enabling support for models like LFM2-8B-A1B.
  • New Configuration: Added a dedicated Lfm2MoeConfig class to define the specific parameters for the LFM2-MoE model, including MoE-specific settings like num_dense_layers, num_experts, and routing mechanisms.
  • Hybrid Layer Implementation: Implemented Lfm2MoeDecoderLayer which dynamically uses either a dense MLP or a Sparse MoE block based on the layer ID, combining attention and ShortConv layers.
  • Tensor Parallelism for ShortConv: Enhanced Lfm2ShortConv to support tensor parallelism, sharding hidden dimensions and utilizing MergedColumnParallelLinear for efficient weight handling.
  • Optimized MoE Handling: Integrated FusedMoE and TopK for efficient batched expert computation and sigmoid routing in Lfm2MoeSparseMoeBlock.
  • Weight Loading Improvements: Updated weight loading logic across lfm2.py and lfm2_moe.py to correctly handle stacked parameters, sharded conv weights, and FusedMoE expert formats.
  • Integration Testing: Added a new test class TestToolChoiceLfm2Moe to validate tool choice functionality with the LFM2-MoE model, with specific tests skipped due to known issues.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the LMF2-MoE model architecture, including a new configuration and model implementation. It also refactors the existing LMF2 model to add tensor parallelism support for its convolution layers, a change that is leveraged in the new MoE version. The changes are well-structured and include necessary updates to model loading, configuration, and testing. I've identified a potential bug in the weight loading logic for convolution layers that appears in both the updated lfm2.py and the new lfm2_moe.py files, which I've detailed in the comments.

Comment on lines +534 to +538
if ".conv.conv.weight" in name:
name = name.replace(".conv.conv.weight", ".conv.conv_weight")
loaded_weight = loaded_weight.squeeze(1) # (D, 1, K) -> (D, K)
if ".conv.conv.bias" in name:
name = name.replace(".conv.conv.bias", ".conv.conv_bias")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The replace calls for .conv.conv.weight and .conv.conv.bias are no-ops as they replace the string with itself. If the intention is to rename the parameter from the Hugging Face model to match the name in this implementation, the replacement string should be different. For example, if the SGLang parameter is named ...conv.conv_weight, the replacement should reflect that. Given that the parameter names seem to be identical between the HF model and this implementation (...conv.conv.weight), this block might be unnecessary. Please clarify the intent or correct the replacement logic.

Comment on lines +601 to +605
if ".conv.conv.weight" in name:
name = name.replace(".conv.conv.weight", ".conv.conv_weight")
loaded_weight = loaded_weight.squeeze(1) # (D, 1, K) -> (D, K)
if ".conv.conv.bias" in name:
name = name.replace(".conv.conv.bias", ".conv.conv_bias")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the lfm2.py model, the replace calls for .conv.conv.weight and .conv.conv.bias are no-ops because they replace the string with itself. If a renaming is needed to load the Hugging Face model weights, the replacement string should be corrected. If the names are already aligned, this block of code can be removed.

@tugot17 tugot17 requested a review from Kangyan-Zhou as a code owner January 30, 2026 15:44
@ispobock
Copy link
Collaborator

ispobock commented Feb 8, 2026

@tugot17 could you fix the lint first?

@@ -0,0 +1,24 @@
name: Internal Release Lint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why update the lint & pr-test workflows here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sorry I mixed up this from my internal branch - will remove

@tugot17 tugot17 force-pushed the feature/lfm2-moe-clean branch from 27f7135 to cf2aec1 Compare February 8, 2026 20:08
@tugot17 tugot17 force-pushed the feature/lfm2-moe-clean branch from cf2aec1 to 929a12c Compare February 8, 2026 20:19
@tugot17
Copy link
Contributor Author

tugot17 commented Feb 8, 2026

@ispobock fixed, sorry went one commit to far :)

Comment on lines +887 to +922
class TestToolChoiceLfm2Moe(TestToolChoiceLlama32):
"""Test tool_choice functionality with LiquidAI LFM2-MoE model"""

@classmethod
def setUpClass(cls):
cls.flaky_tests = {
"test_multi_tool_scenario_auto",
"test_multi_tool_scenario_required",
}

cls.model = "LiquidAI/LFM2-8B-A1B"
cls.base_url = DEFAULT_URL_FOR_TEST
cls.api_key = "sk-123456"

cls.process = popen_launch_server(
cls.model,
cls.base_url,
timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
api_key=cls.api_key,
other_args=[
"--tool-call-parser",
"lfm2",
],
)
cls.base_url += "/v1"
cls.tokenizer = get_tokenizer(cls.model)

@unittest.skip("maxItems:1 bug causes whitespace stall")
def test_tool_choice_required_non_streaming(self):
pass

@unittest.skip("maxItems:1 bug causes whitespace stall")
def test_tool_choice_specific_function_non_streaming(self):
pass


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've already tested the LFM2 parser functionality above and the 8B model is relatively heavy for this CI, I think we should remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the only MoE model in the test, might be useful for testing if the FusedMoE kernel didn't cause some regressions, and as far as I know this might be the smallest MoE model in SGLang (correct me if I'm wrong)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah then maybe it's better to add another gsm8k test elsewhere appropriate, as this CI is only for tool call test?

@JustinTong0323
Copy link
Collaborator

Tested with gsm8k:

ccuracy: 0.830
Invalid: 0.000
Latency: 6.183 s
Output throughput: 2892.703 token/s

And mmlu:

Total latency: 58.788
Average accuracy: 0.648

Matches the official number, thanks for the support!

@ispobock
Copy link
Collaborator

ispobock commented Feb 9, 2026

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Feb 9, 2026
return [i for i, lt in enumerate(self.layer_types) if lt == "full_attention"]

@property
def linear_layer_ids(self) -> List[int]:
Copy link
Contributor

@ChangyiYang ChangyiYang Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is intended to reuse the mamba2 cache, but the naming here feels a bit odd. Maybe better refactor this property's name to be something like linear_att_layer_ids in the future. (Not blocking, just noticed when reading)

@ispobock ispobock merged commit ded068a into sgl-project:main Feb 11, 2026
141 of 156 checks passed
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Feb 11, 2026
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants