Add LMF2 MoE model architecture#17997
Conversation
Summary of ChangesHello @tugot17, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the Liquid Foundation Model Mixture of Experts (LFM2-MoE) architecture, a hybrid model combining attention, ShortConv, and MoE layers. It introduces a dedicated configuration, implements the model's various components including a dynamic decoder layer that switches between dense MLP and sparse MoE blocks, and ensures compatibility with tensor parallelism. The changes also include refined weight loading mechanisms and initial integration tests to validate the new model's functionality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the LMF2-MoE model architecture, including a new configuration and model implementation. It also refactors the existing LMF2 model to add tensor parallelism support for its convolution layers, a change that is leveraged in the new MoE version. The changes are well-structured and include necessary updates to model loading, configuration, and testing. I've identified a potential bug in the weight loading logic for convolution layers that appears in both the updated lfm2.py and the new lfm2_moe.py files, which I've detailed in the comments.
| if ".conv.conv.weight" in name: | ||
| name = name.replace(".conv.conv.weight", ".conv.conv_weight") | ||
| loaded_weight = loaded_weight.squeeze(1) # (D, 1, K) -> (D, K) | ||
| if ".conv.conv.bias" in name: | ||
| name = name.replace(".conv.conv.bias", ".conv.conv_bias") |
There was a problem hiding this comment.
The replace calls for .conv.conv.weight and .conv.conv.bias are no-ops as they replace the string with itself. If the intention is to rename the parameter from the Hugging Face model to match the name in this implementation, the replacement string should be different. For example, if the SGLang parameter is named ...conv.conv_weight, the replacement should reflect that. Given that the parameter names seem to be identical between the HF model and this implementation (...conv.conv.weight), this block might be unnecessary. Please clarify the intent or correct the replacement logic.
| if ".conv.conv.weight" in name: | ||
| name = name.replace(".conv.conv.weight", ".conv.conv_weight") | ||
| loaded_weight = loaded_weight.squeeze(1) # (D, 1, K) -> (D, K) | ||
| if ".conv.conv.bias" in name: | ||
| name = name.replace(".conv.conv.bias", ".conv.conv_bias") |
There was a problem hiding this comment.
Similar to the lfm2.py model, the replace calls for .conv.conv.weight and .conv.conv.bias are no-ops because they replace the string with itself. If a renaming is needed to load the Hugging Face model weights, the replacement string should be corrected. If the names are already aligned, this block of code can be removed.
|
@tugot17 could you fix the lint first? |
| @@ -0,0 +1,24 @@ | |||
| name: Internal Release Lint | |||
There was a problem hiding this comment.
Why update the lint & pr-test workflows here?
There was a problem hiding this comment.
yes sorry I mixed up this from my internal branch - will remove
27f7135 to
cf2aec1
Compare
- Replace nn.Linear with ColumnParallelLinear/RowParallelLinear - Shard conv_weight and conv_bias along hidden dimension - Add sharded_weight_loader for proper weight loading with TP - Update forward methods to handle parallel linear tuple returns This enables LFM2 and LFM2-MoE to run with tensor parallelism > 1.
cf2aec1 to
929a12c
Compare
|
@ispobock fixed, sorry went one commit to far :) |
| class TestToolChoiceLfm2Moe(TestToolChoiceLlama32): | ||
| """Test tool_choice functionality with LiquidAI LFM2-MoE model""" | ||
|
|
||
| @classmethod | ||
| def setUpClass(cls): | ||
| cls.flaky_tests = { | ||
| "test_multi_tool_scenario_auto", | ||
| "test_multi_tool_scenario_required", | ||
| } | ||
|
|
||
| cls.model = "LiquidAI/LFM2-8B-A1B" | ||
| cls.base_url = DEFAULT_URL_FOR_TEST | ||
| cls.api_key = "sk-123456" | ||
|
|
||
| cls.process = popen_launch_server( | ||
| cls.model, | ||
| cls.base_url, | ||
| timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, | ||
| api_key=cls.api_key, | ||
| other_args=[ | ||
| "--tool-call-parser", | ||
| "lfm2", | ||
| ], | ||
| ) | ||
| cls.base_url += "/v1" | ||
| cls.tokenizer = get_tokenizer(cls.model) | ||
|
|
||
| @unittest.skip("maxItems:1 bug causes whitespace stall") | ||
| def test_tool_choice_required_non_streaming(self): | ||
| pass | ||
|
|
||
| @unittest.skip("maxItems:1 bug causes whitespace stall") | ||
| def test_tool_choice_specific_function_non_streaming(self): | ||
| pass | ||
|
|
||
|
|
There was a problem hiding this comment.
Since we've already tested the LFM2 parser functionality above and the 8B model is relatively heavy for this CI, I think we should remove it.
There was a problem hiding this comment.
it is the only MoE model in the test, might be useful for testing if the FusedMoE kernel didn't cause some regressions, and as far as I know this might be the smallest MoE model in SGLang (correct me if I'm wrong)
There was a problem hiding this comment.
Ah then maybe it's better to add another gsm8k test elsewhere appropriate, as this CI is only for tool call test?
|
Tested with gsm8k: And mmlu: Matches the official number, thanks for the support! |
|
/tag-and-rerun-ci |
| return [i for i, lt in enumerate(self.layer_types) if lt == "full_attention"] | ||
|
|
||
| @property | ||
| def linear_layer_ids(self) -> List[int]: |
There was a problem hiding this comment.
I understand this is intended to reuse the mamba2 cache, but the naming here feels a bit odd. Maybe better refactor this property's name to be something like linear_att_layer_ids in the future. (Not blocking, just noticed when reading)
This PR introduces Liquid Foundation Model Mixture of Experts architecture.
Example model using this architecture: LFM2-8B-A1B
How to run
Benchmarks
GPQA Dimond: 34.04 vs. 29.29 reported
IFBench: 26.53 vs. 25.85 reported
Integration test for function calling:
pytest test/registered/openai_server/function_call/test_tool_choice.py::TestToolChoiceLfm2Moe -v -s ================================ 12 passed, 2 skipped, 2 warnings in 69.84s (0:01:09) ================================We skip 2 tests related to this issue: #17998
Numerics
{ "prompt": "<|startoftext|><|im_start|>user\nThe capital of the United Kingdom is<|im_end|>\n<|im_start|>assistant\n", "hf_output": "The capital of the United Kingdom is **London**. While the UK does not have a single \"capital city\" in the traditional sense\u2014since it is", "sglang_output": "The capital of the United Kingdom is **London**. While the UK does not have a single \"capital city\" in the traditional sense\u2014since it is", "rouge_l": 1.0, "prefill_max_diff": 0.6650424003601074, "prefill_mean_diff": 0.1140863299369812, "decode_max_diff": 0.2501058578491211, "decode_mean_diff": 0.07756583392620087 }, { "prompt": "<|startoftext|><|im_start|>user\nToday is a sunny day and I like<|im_end|>\n<|im_start|>assistant\n", "hf_output": "Today is a sunny day and I like nothing more than stepping outside to feel the warm sun on my skin. There\u2019s something magical about the golden light filtering through", "sglang_output": "Today is a sunny day and I like nothing more than stepping outside to feel the warm sun on my skin. There\u2019s something magical about the golden light filtering through", "rouge_l": 1.0, "prefill_max_diff": 0.05985307693481445, "prefill_mean_diff": 0.01562848500907421, "decode_max_diff": 0.031732797622680664, "decode_mean_diff": 0.013126413337886333 }, { "prompt": "<|startoftext|><|im_start|>user\nAI is a field of computer science focused on<|im_end|>\n<|im_start|>assistant\n", "hf_output": "AI (Artificial Intelligence) is a field of computer science focused on creating systems or machines that can perform tasks requiring human-like intelligence. These tasks include learning from", "sglang_output": "AI (Artificial Intelligence) is a field of computer science focused on creating systems or machines that can perform tasks requiring human-like intelligence. These tasks include learning from", "rouge_l": 1.0, "prefill_max_diff": 0.4699575901031494, "prefill_mean_diff": 0.06591875851154327, "decode_max_diff": 0.7499990463256836, "decode_mean_diff": 0.09959111362695694 }We also tested on TP2:
Server works!
Note: This PR relies on support for TP in Liquid Foundation Models being supported, the thing introduced in this PR:
#17777
we can merge this at the same time, or first fix TP and than these changes