Skip to content

Add Liquid Foundation Model (LFM2)#16890

Merged
ispobock merged 16 commits intosgl-project:mainfrom
tugot17:add-LFM2
Jan 22, 2026
Merged

Add Liquid Foundation Model (LFM2)#16890
ispobock merged 16 commits intosgl-project:mainfrom
tugot17:add-LFM2

Conversation

@tugot17
Copy link
Contributor

@tugot17 tugot17 commented Jan 11, 2026

Summary

  • Add support for LiquidAI's LFM2 hybrid architecture (attention + ShortConv layers)
  • LFM2 uses gated 1D causal convolution (kernel=3) instead of attention in some layers, requiring SGLang's hybrid caching system
  • Tested with LiquidAI/LFM2.5-1.2B-Instruct and LiquidAI/LFM2-2.6B-Exp

Changes

File Purpose
srt/models/lfm2.py Full model implementation with ShortConv layers
srt/configs/lfm2.py LFM2 config with mamba2_cache_params
srt/configs/mamba_utils.py Dynamic conv dtype selection (fixes CUDA graph capture)
srt/model_executor/model_runner.py Add LFM2 to hybrid model detection
srt/function_call/lfm2_detector.py Streaming parser for LFM2 tool call format
srt/function_call/function_call_parser.py Register lfm2 parser
test/srt/models/test_generation_models.py Add LFM2 to generation test suite
test/registered/function_call/test_function_call_parser.py Unit tests for Lfm2Detector
test/registered/openai_server/function_call/test_tool_choice.py Integration tests (TestToolChoiceLfm2)

Key Technical Details

ShortConv Architecture

  • ShortConv layers use fixed-size cache (2 tokens) vs attention's growing KV cache
  • Uses HybridReqToTokenPool + MambaPool for hybrid caching
  • Real convolution done by Triton kernel causal_conv1d_fn()

Conv State Dtype Fix

The causal_conv1d_update kernel requires conv state dtype to match input dtype exactly. We fixed a dtype mismatch that caused CUDA graph capture to fail:

  • Problem: mamba_utils.py hardcoded CONV_DTYPE = torch.bfloat16, but tests run models in torch.float16
  • Root cause: Server path sets SGLANG_MAMBA_SSM_DTYPE env var via ServerArgs, but test path uses Engine directly (bypassing this)
  • Solution:
    • Added get_conv_dtype() function that dynamically gets dtype from torch.get_default_dtype()
    • Added default fallback for SGLANG_MAMBA_SSM_DTYPE env var
    • Wrapped init_memory_pool() in set_default_torch_dtype(self.model_config.dtype) context

This fix is safe for other Mamba-based models (NemotronH, FalconH1, Qwen3Next) - server behavior is unchanged.

Function Calling Support

Added Lfm2Detector for parsing LFM2's tool call format with special tokens:

<|tool_call_start|>[get_weather(city="Paris")]<|tool_call_end|>

Usage: --tool-call-parser lfm2

Tests

Logprob accuracy test (compares SGLang vs HuggingFace):

SGLANG_MAMBA_CONV_DTYPE=float16 ONLY_RUN=LiquidAI/LFM2.5-1.2B-Instruct pytest test/registered/models/test_generation_models.py -v -s

LFM2 function call parser (unit test):

pytest test/registered/function_call/test_function_call_parser.py::TestLfm2Detector -v -s

LFM2 function calling (integration test):

pytest test/registered/openai_server/function_call/test_tool_choice.py::TestToolChoiceLfm2 -v -s

Benchmark Performance

We run MMLU-Pro and tau2 on our internal OAI-server compatible benchmarking suite and the scores match the declared performance.

Running on 1xH100 SXM5

#!/usr/bin/env bash
set -e

MODEL="${MODEL:-LiquidAI/LFM2.5-1.2B-Instruct}"
# MODEL="${MODEL:-LiquidAI/LFM2-2.6B-Exp}"
HOST="${HOST:-127.0.0.1}"
PORT="${PORT:-30001}"

echo "Starting SGLang server: $MODEL on $HOST:$PORT"

sglang serve \
    --model-path "$MODEL" \
    --host "$HOST" \
    --port "$PORT" \
    --tool-call-parser lfm2
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  3.77
Total input tokens:                      127216
Total input text tokens:                 127216
Total input vision tokens:               0
Total generated tokens:                  63834
Total generated tokens (retokenized):    62640
Request throughput (req/s):              132.79
Input token throughput (tok/s):          33786.60
Output token throughput (tok/s):         16953.32
Peak output token throughput (tok/s):    33486.00
Peak concurrent requests:                500
Total token throughput (tok/s):          50739.92
Concurrency:                             404.12
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3043.26
Median E2E Latency (ms):                 3341.52
---------------Time to First Token----------------
Mean TTFT (ms):                          863.90
Median TTFT (ms):                        786.99
P99 TTFT (ms):                           1621.48
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.01
Median TPOT (ms):                        18.84
P99 TPOT (ms):                           310.85
---------------Inter-Token Latency----------------
Mean ITL (ms):                           17.24
Median ITL (ms):                         8.57
P95 ITL (ms):                            18.11
P99 ITL (ms):                            120.99
Max ITL (ms):                            996.73
==================================================

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tugot17, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates comprehensive support for the Liquid Foundation Model 2 (LFM2) into the SGLang framework. LFM2 is a novel hybrid architecture that leverages both traditional attention mechanisms and efficient ShortConv layers. The changes include the complete model implementation, a specialized configuration for managing its unique hybrid caching requirements, and a new function call parser tailored to LFM2's specific format. This integration significantly expands SGLang's capabilities to support advanced hybrid models, ensuring efficient inference and accurate function calling.

Highlights

  • LFM2 Model Integration: Added full support for LiquidAI's LFM2 hybrid architecture, which combines attention and ShortConv layers for efficient processing.
  • Hybrid Caching System: Implemented SGLang's hybrid caching system (HybridReqToTokenPool + MambaPool) to manage both KV cache for attention layers and fixed-size state for ShortConv layers.
  • Dynamic Conv Dtype Selection: Introduced a fix for CUDA graph capture issues by dynamically selecting the convolution state dtype based on torch.get_default_dtype(), ensuring compatibility across different test and server environments.
  • LFM2 Function Calling Support: Added a dedicated Lfm2Detector for parsing LFM2's unique tool call format, supporting both Pythonic and JSON syntaxes, along with streaming parsing capabilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Liquid Foundation Model (LFM2), a hybrid architecture model. The changes are comprehensive, covering model implementation, configuration, and function call parsing. A key improvement is the dynamic selection of convolution state dtype, which resolves a CUDA graph capture issue and enhances robustness. The new Lfm2Detector correctly handles both Pythonic and JSON-based tool call formats. The addition of extensive unit and integration tests ensures the new model is well-integrated and functions as expected. Overall, this is a high-quality contribution. I have one minor suggestion to remove some unused code for better maintainability.

@JustinTong0323
Copy link
Collaborator

please resolve the conflict, thanks~

# Init memory pool and attention backends
self.init_memory_pool(min_per_gpu_memory)
# Set default dtype so mamba2_cache_params picks up the correct dtype for conv state
with set_default_torch_dtype(self.model_config.dtype):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there was an issue when we initialized the model with fp16, but now I just propagate the dtype.

@yizhang2077 wdyt?

@JustinTong0323
Copy link
Collaborator

JustinTong0323 commented Jan 16, 2026

/tag-and-rerun-ci

@JustinTong0323
Copy link
Collaborator

Got this on B200:

Scheduler hit an exception: Traceback (most recent call last):
  File "/root/xinyuan/sglang/python/sglang/srt/managers/scheduler.py", line 2850, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/root/xinyuan/sglang/python/sglang/srt/managers/scheduler.py", line 346, in __init__
    self.init_cache_with_memory_pool()
  File "/root/xinyuan/sglang/python/sglang/srt/managers/scheduler.py", line 665, in init_cache_with_memory_pool
    self.tree_cache = MambaRadixCache(params)
                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/xinyuan/sglang/python/sglang/srt/mem_cache/mamba_radix_cache.py", line 377, in __init__
    self.page_size == 1
AssertionError: Page size must be 1 for MambaRadixCache v1, got 64

@tugot17
Copy link
Contributor Author

tugot17 commented Jan 17, 2026

@JustinTong0323
ok this is odd, shouldn't happen, but I tested it only on H100; I will run it tomorrow on Blackwell and get back to you

LFM2 was failing on B200/SM100 because:
1. SM100 defaults to trtllm_mha backend which forces page_size=64
2. MambaRadixCache requires page_size=1 for hybrid models
3. Triton backend doesn't work because LFM2's first layer is conv, not attention

Add Lfm2ForCausalLM to server_args.py with same handling as NemotronH:
- Use flashinfer backend on SM100 (supports page_size=1)
- Disable overlap schedule with radix cache
- Block triton backend (layer 0 is not an attention layer)
@tugot17
Copy link
Contributor Author

tugot17 commented Jan 17, 2026

Fixed! The issue was that LFM2 wasn't in the SM100 special handling in server_args.py.

On B200, the default backend is trtllm_mha which forces page_size=64, but MambaRadixCache
requires page_size=1. Added Lfm2ForCausalLM with the same handling as NemotronH — uses
flashinfer on SM100 instead.

(Also had to block triton backend since LFM2's first layer is a conv layer, not attention.)

The tests pass now on B200 as well

@tugot17
Copy link
Contributor Author

tugot17 commented Jan 19, 2026

@JustinTong0323 Could we merge it now?

@ChangyiYang
Copy link
Contributor

ChangyiYang commented Jan 20, 2026

I will help to verify this model on gsm8k to see if it works as expeced required by @JustinTong0323
Result will be updated in this comment.

@ispobock
Copy link
Collaborator

@ChangyiYang Could you share the test results?

@ChangyiYang
Copy link
Contributor

@ChangyiYang Could you share the test results?

Let me get to you by today.


# Propagate runtime dtype to hf_config so that hybrid models (mamba, LFM2, etc.)
# can use it for conv state cache dtype
self.hf_config.torch_dtype = self.dtype
Copy link
Collaborator

@yizhang2077 yizhang2077 Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use other way to pass dtype to conv, I doubt it may affect other models here.
Besides this, it looks good for me in other part

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tugot17 could you address this comment?

Copy link
Contributor Author

@tugot17 tugot17 Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yizhang2077 what do you think about changing it in mamba_utils.py

def mamba2_state_dtype() -> Mamba2StateDType:
    dtype_map = {
        "float32": torch.float32,
        "bfloat16": torch.bfloat16,
        "float16": torch.float16,
    }
    conv_dtype = dtype_map.get(
        os.environ.get("SGLANG_MAMBA_CONV_DTYPE", "bfloat16"), torch.bfloat16
    )
    ssm_dtype = dtype_map.get(
        os.environ.get("SGLANG_MAMBA_SSM_DTYPE", "float32"), torch.float32
    )
    return Mamba2StateDType(conv=conv_dtype, temporal=ssm_dtype)

This way we could modify SGLANG_MAMBA_CONV_DTYPE as an ENV VARIABLE, similar to how it is already done for other models. Than the tests that use fp16 should just pass.

Would this make sense? I agree the current version is kinda too hacky.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use this way.

Copy link
Contributor Author

@tugot17 tugot17 Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yizhang2077

I also added this to the tests so they don't require setting the flag manually

def assert_close_logits_and_output_strs(
        self,
        prompts: List[str],
        model_case: ModelCase,
        torch_dtype: torch.dtype,
    ) -> None:
        model_path = model_case.model_path
        prefill_tolerance, decode_tolerance, rouge_l_tolerance = (
            model_case.prefill_tolerance,
            model_case.decode_tolerance,
            model_case.rouge_l_tolerance,
        )
        max_new_tokens = 32

        # Set conv dtype for hybrid models to match inference dtype
        dtype_str = {torch.float16: "float16", torch.bfloat16: "bfloat16"}.get(
            torch_dtype, "bfloat16"
        )
        os.environ["SGLANG_MAMBA_CONV_DTYPE"] = dtype_str

@ChangyiYang
Copy link
Contributor

I run model with command

#!/usr/bin/env bash
set -e

export CUDA_VISIBLE_DEVICES=1

MODEL="${MODEL:-LiquidAI/LFM2.5-1.2B-Instruct}"
# MODEL="${MODEL:-LiquidAI/LFM2-2.6B-Exp}"
HOST="${HOST:-127.0.0.1}"
PORT="${PORT:-30001}"

echo "Starting SGLang server: $MODEL on $HOST:$PORT"

sglang serve \
    --model-path "$MODEL" \
    --host "$HOST" \
    --port "$PORT" \
    --tool-call-parser lfm2

The test command and result is:

root@2763e46d6169:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-questions 1000 --host http://127.0.0.1 --
port 30001
100%|███████████████████████| 1000/1000 [00:16<00:00, 60.97it/s]
Accuracy: 0.554
Invalid: 0.001
Latency: 16.421 s
Output throughput: 8042.606 token/s

Is this expected performace?

@JustinTong0323
Copy link
Collaborator

We also get following result for LiquidAI/LFM2.5-1.2B-Instruct using lm_eval:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6126|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.6118|±  |0.0134|
|mmlu_pro|      2|custom-extract|      |exact_match|↑  |0.3661|±  |0.0043|

@tugot17 Could you please confirm it's the evaluation mismatch or the implementation's issue? Thanks~

@JustinTong0323
Copy link
Collaborator

BTW, the tool call parser LGTM

# For ShortConv layers, we use a simplified Mamba2StateShape
# LFM2 doesn't use SSM state (state_size=0), only conv state
shape = Mamba2StateShape.create(
tp_world_size=tp_size,
Copy link
Collaborator

@yizhang2077 yizhang2077 Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to refactor it later but it is ok for current pr. I think ShortConv-only models being mixed with mamba models is tricky here. cc @ispobock @hebiao064

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to do some refactor later.

@ispobock
Copy link
Collaborator

@tugot17 Could you address the above comments and verify the accuracy? And then we can merge it soon.

@tugot17
Copy link
Contributor Author

tugot17 commented Jan 21, 2026

@ispobock

I will run the internal eval tool today and get back to you.

@tugot17
Copy link
Contributor Author

tugot17 commented Jan 21, 2026

@ispobock
I run on the internal eval tool, using H100 SXM5, through the OAI compatible endpoint:

  • mmlu_pro: 44.07 vs. 44.35 reported
  • aime25: 13.00 vs. 14.00 reported
  • gpqa_diamond: 38.38 vs. 38.89 reported

This is very similar (some slighly better, some slighly worse) to the numbers I get from the internal vLLM.

@ispobock
Copy link
Collaborator

ispobock commented Jan 21, 2026

Hi @tugot17 Thanks for the evaluation! Could you address the torch_dtype comments #16890 (comment)? I don't think we should update hf_conifg.torch_dtype in in model config.

@tugot17
Copy link
Contributor Author

tugot17 commented Jan 21, 2026

Hi @tugot17 Thanks for the evaluation! Could you address the torch_dtype comments #16890 (comment)? I don't think we should update hf_conifg.torch_dtype in in model config.

yes, I proposed another solution, this one was a workaround due to the numeric tests running in fp16 that crushed the cuda graphs config

@ispobock
Copy link
Collaborator

/rerun-stage unit-test-backend-4-gpu

@github-actions
Copy link
Contributor

✅ Triggered unit-test-backend-4-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Contributor

🔗 View workflow run

@tugot17
Copy link
Contributor Author

tugot17 commented Jan 21, 2026

@yizhang2077 Could you approve again? I added one commit to make tests more smooth, see comment

@ispobock
Copy link
Collaborator

/tag-and-rerun-ci

@ispobock ispobock merged commit d6e2b88 into sgl-project:main Jan 22, 2026
180 of 228 checks passed
@vincentzed
Copy link
Contributor

B300:

python3 -m sglang.launch_server --model LiquidAI/LFM2.5-1.2B-Instruct --enable-torch-compile \
    --cuda-graph-max-bs 4 --chunked-prefill-size -1

We found that such a workload could be as little as 180ms e2e. Amazing!

python3 -m sglang.bench_serving --backend sglang-oai-chat --num-prompts 256 --max-concurrency 1 --random-input-len 1024 --random-output-len 128 --warmup-requests 128
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  49.16     
Total input tokens:                      82131     
Total input text tokens:                 82131     
Total generated tokens:                  54126     
Total generated tokens (retokenized):    54029     
Request throughput (req/s):              5.21      
Input token throughput (tok/s):          1670.54   
Output token throughput (tok/s):         1100.92   
Peak output token throughput (tok/s):    1138.00   
Peak concurrent requests:                13        
Total token throughput (tok/s):          2771.47   
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   191.88    
Median E2E Latency (ms):                 119.90    
P90 E2E Latency (ms):                    460.06    
P99 E2E Latency (ms):                    760.65    
---------------Time to First Token----------------
Mean TTFT (ms):                          8.79      
Median TTFT (ms):                        8.11      
P99 TTFT (ms):                           17.45     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.86      
Median TPOT (ms):                        0.86      
P99 TPOT (ms):                           0.89      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.87      
Median ITL (ms):                         0.87      
P95 ITL (ms):                            0.89      
P99 ITL (ms):                            1.00      
Max ITL (ms):                            30.81     
==================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants