Skip to content

Model: Support LiquidAI/LFM2 (Dense) (700M,1.2B, 2.6B) #16065

Closed
blazingbhavneek wants to merge 5 commits intosgl-project:mainfrom
blazingbhavneek:feature/support-lfm2
Closed

Model: Support LiquidAI/LFM2 (Dense) (700M,1.2B, 2.6B) #16065
blazingbhavneek wants to merge 5 commits intosgl-project:mainfrom
blazingbhavneek:feature/support-lfm2

Conversation

@blazingbhavneek
Copy link
Copy Markdown
Contributor

@blazingbhavneek blazingbhavneek commented Dec 29, 2025

Motivation

Add Support for LFM2 Model Family (Dense) by Liquid AI

Modifications

Porting Model implementation from vLLM and Huggingface

Accuracy Tests

Command for SGLang Server:
python -m sglang.launch_server --model-path LiquidAI/LFM2-700M --port 30000 --attention-backend triton

Command for vLLM Server:
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model LiquidAI/LFM2-700M --disable-log-requests --port 21000

Benchmark Model Engine Accuracy
MMLU LFM2-700M SGLang 0.491
vLLM 0.495
LFM2-2.6B SGLang 0.641
vLLM 0.641
GSM8K LFM2-700M SGLang 0.425
vLLM 0.430
LFM2-2.6B SGLang 0.820
vLLM 0.790

MMLU:

LFM2-700M

SGLang
❯ python3 bench_sglang.py --nsub 10
100%|████████████████████████████████████| 1369/1369 [00:07<00:00, 187.57it/s]
subject: abstract_algebra, #q:100, acc: 0.340
subject: anatomy, #q:135, acc: 0.444
subject: astronomy, #q:152, acc: 0.638
subject: business_ethics, #q:100, acc: 0.520
subject: clinical_knowledge, #q:265, acc: 0.570
subject: college_biology, #q:144, acc: 0.590
subject: college_chemistry, #q:100, acc: 0.330
subject: college_computer_science, #q:100, acc: 0.370
subject: college_mathematics, #q:100, acc: 0.310
subject: college_medicine, #q:173, acc: 0.532
Total latency: 7.303
Average accuracy: 0.491
vLLM
❯ python3 bench_other.py --nsub 10 --backend vllm
  0%|                                                  | 0/10 [00:00<?, ?it/s]Average accuracy 0.340, latency 1.60, #q: 100 - abstract_algebra
 10%|████▏                                     | 1/10 [00:01<00:14,  1.60s/it]Average accuracy 0.437, latency 1.46, #q: 135 - anatomy
 20%|████████▍                                 | 2/10 [00:03<00:12,  1.53s/it]Average accuracy 0.638, latency 2.83, #q: 152 - astronomy
 30%|████████████▌                             | 3/10 [00:05<00:14,  2.13s/it]Average accuracy 0.520, latency 1.87, #q: 100 - business_ethics
 40%|████████████████▊                         | 4/10 [00:07<00:12,  2.03s/it]Average accuracy 0.585, latency 3.37, #q: 265 - clinical_knowledge
 50%|█████████████████████                     | 5/10 [00:11<00:12,  2.52s/it]Average accuracy 0.597, latency 2.14, #q: 144 - college_biology
 60%|█████████████████████████▏                | 6/10 [00:13<00:09,  2.39s/it]Average accuracy 0.340, latency 1.79, #q: 100 - college_chemistry
 70%|█████████████████████████████▍            | 7/10 [00:15<00:06,  2.20s/it]Average accuracy 0.400, latency 2.76, #q: 100 - college_computer_science
 80%|█████████████████████████████████▌        | 8/10 [00:17<00:04,  2.38s/it]Average accuracy 0.290, latency 1.93, #q: 100 - college_mathematics
 90%|█████████████████████████████████████▊    | 9/10 [00:19<00:02,  2.24s/it]Average accuracy 0.526, latency 2.77, #q: 173 - college_medicine
100%|█████████████████████████████████████████| 10/10 [00:22<00:00,  2.26s/it]
Total latency: 22.521
Average accuracy: 0.495

Benchmarking and Profiling

Benchmark

python -m sglang.bench_one_batch --model-path LiquidAI/LFM2-700M --batch 4 -
-input-len 2048 --output-len 1024 --attention-backend triton
WARNING:sglang.srt.server_args:Disabling overlap schedule since mamba no_buffer is not compatible with overlap schedule, try to use --disable-radix-cache if overlap schedule is necessary
[2025-12-29 18:33:02 TP0] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-29 18:33:02 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-12-29 18:33:03 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/home/blazingbhavneek/miniconda3/envs/sglang/lib/python3.11/site-packages/transformers/__init__.py)
[2025-12-29 18:33:03 TP0] Load weight begin. avail mem=15.33 GB
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.32it/s]

[2025-12-29 18:33:03 TP0] Load weight end. type=Lfm2ForCausalLM, dtype=torch.bfloat16, avail mem=13.81 GB, mem usage=1.52 GB.
[2025-12-29 18:33:03 TP0] Using KV cache dtype: torch.bfloat16
[2025-12-29 18:33:03 TP0] Mamba Cache is allocated. max_mamba_cache_size: 85292, conv_state size: 4.88GB, ssm_state size: 0.00GB 
[2025-12-29 18:33:03 TP0] KV Cache is allocated. #tokens: 473854, K size: 2.71 GB, V size: 2.71 GB
[2025-12-29 18:33:03 TP0] Memory pool end. avail mem=2.43 GB
[2025-12-29 18:33:03 TP0] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-29 18:33:03 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=2.42 GB
[2025-12-29 18:33:03 TP0] Capture cuda graph bs [1, 2, 4, 8]
Capturing batches (bs=1 avail_mem=2.39 GB): 100%|| 4/4 [00:00<00:00,  7.80it/
[2025-12-29 18:33:04 TP0] Capture cuda graph end. Time elapsed: 1.01 s. mem usage=0.03 GB. avail mem=2.39 GB.
max_total_num_tokens=473854
Warmup ...
[2025-12-29 18:33:04 TP0] Reset HybridReqToTokenPool
Prefill. latency: 0.30282 s, throughput:  27052.23 token/s
Decode 0. Batch size: 4, latency: 0.17757 s, throughput:     22.53 token/s
Decode 1. Batch size: 4, latency: 0.00518 s, throughput:    771.52 token/s
Decode 2. Batch size: 4, latency: 0.00538 s, throughput:    743.27 token/s
Decode 3. Batch size: 4, latency: 0.00761 s, throughput:    525.59 token/s
Decode 4. Batch size: 4, latency: 0.00516 s, throughput:    775.31 token/s
Decode.  median latency: 0.00507 s, median throughput:    788.46 token/s
Total. latency:  0.636 s, throughput:  13084.74 token/s
Benchmark ...
[2025-12-29 18:33:05 TP0] Reset HybridReqToTokenPool
Prefill. latency: 0.27105 s, throughput:  30222.82 token/s
Decode 0. Batch size: 4, latency: 0.00520 s, throughput:    768.89 token/s
Decode 1. Batch size: 4, latency: 0.00498 s, throughput:    802.91 token/s
Decode 2. Batch size: 4, latency: 0.00717 s, throughput:    557.95 token/s
Decode 3. Batch size: 4, latency: 0.00514 s, throughput:    777.59 token/s
Decode 4. Batch size: 4, latency: 0.00503 s, throughput:    795.59 token/s
Decode.  median latency: 0.00482 s, median throughput:    830.67 token/s
Total. latency:  5.225 s, throughput:   2351.56 token/s

Checklist

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 29, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @blazingbhavneek, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands SGLang's model compatibility by introducing full support for the LiquidAI LFM2 model family. This integration allows users to leverage LFM2's efficient hybrid architecture, which is optimized for on-device deployment and fast inference, within the SGLang framework. The changes encompass the addition of model-specific configurations, the implementation of its unique attention and convolutional layers, and thorough validation through accuracy tests and benchmarks.

Highlights

  • New Model Support: Added comprehensive support for the LiquidAI LFM2 model family (700M, 1.2B, 2.6B), enabling their use within SGLang.
  • Hybrid Architecture Implementation: Implemented the LFM2 model's hybrid architecture, which combines attention and convolutional layers, drawing inspiration from vLLM and Hugging Face implementations.
  • Performance Validation: Included accuracy test results (MMLU, GSM8K) demonstrating comparable performance to vLLM, and provided benchmarking results for throughput.
  • System Integration: Integrated the new LFM2 model configuration and architecture into SGLang's core components, including model configuration, execution, and server arguments.
  • Documentation & Testing: Updated documentation to reflect LFM2 as a supported model and added unit tests to ensure proper functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the LiquidAI/LFM2 model family. The changes include adding the model configuration, the model implementation itself, and integrating it into the existing server infrastructure and tests. The implementation appears to be a solid port from vLLM and Hugging Face Transformers. I've identified a few areas for improvement, including a potential TypeError that could occur in the model's forward pass, a couple of unused parameters, and a redundant computation. My review includes specific suggestions to address these points.

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@tugot17
Copy link
Copy Markdown
Contributor

tugot17 commented Jan 12, 2026

I also made a version of dense LFM2, utilizing the conv1d kernel and with function calling integrated. We probably should try to somehow merge these changes before:

#16890

@blazingbhavneek
Copy link
Copy Markdown
Contributor Author

Closing this PR on request of LiquidAI reps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants