Model: Support LiquidAI/LFM2 (Dense) (700M,1.2B, 2.6B) by blazingbhavneek · Pull Request #16065 · sgl-project/sglang

blazingbhavneek · 2025-12-29T09:38:41Z

Motivation

Add Support for LFM2 Model Family (Dense) by Liquid AI

Modifications

Porting Model implementation from vLLM and Huggingface

Accuracy Tests

Command for SGLang Server:
python -m sglang.launch_server --model-path LiquidAI/LFM2-700M --port 30000 --attention-backend triton

Command for vLLM Server:
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model LiquidAI/LFM2-700M --disable-log-requests --port 21000

Benchmark	Model	Engine	Accuracy
MMLU	LFM2-700M	SGLang	0.491
		vLLM	0.495
	LFM2-2.6B	SGLang	0.641
		vLLM	0.641
GSM8K	LFM2-700M	SGLang	0.425
		vLLM	0.430
	LFM2-2.6B	SGLang	0.820
		vLLM	0.790

MMLU:

LFM2-700M

SGLang

❯ python3 bench_sglang.py --nsub 10
100%|████████████████████████████████████| 1369/1369 [00:07<00:00, 187.57it/s]
subject: abstract_algebra, #q:100, acc: 0.340
subject: anatomy, #q:135, acc: 0.444
subject: astronomy, #q:152, acc: 0.638
subject: business_ethics, #q:100, acc: 0.520
subject: clinical_knowledge, #q:265, acc: 0.570
subject: college_biology, #q:144, acc: 0.590
subject: college_chemistry, #q:100, acc: 0.330
subject: college_computer_science, #q:100, acc: 0.370
subject: college_mathematics, #q:100, acc: 0.310
subject: college_medicine, #q:173, acc: 0.532
Total latency: 7.303
Average accuracy: 0.491

vLLM

❯ python3 bench_other.py --nsub 10 --backend vllm
  0%|                                                  | 0/10 [00:00<?, ?it/s]Average accuracy 0.340, latency 1.60, #q: 100 - abstract_algebra
 10%|████▏                                     | 1/10 [00:01<00:14,  1.60s/it]Average accuracy 0.437, latency 1.46, #q: 135 - anatomy
 20%|████████▍                                 | 2/10 [00:03<00:12,  1.53s/it]Average accuracy 0.638, latency 2.83, #q: 152 - astronomy
 30%|████████████▌                             | 3/10 [00:05<00:14,  2.13s/it]Average accuracy 0.520, latency 1.87, #q: 100 - business_ethics
 40%|████████████████▊                         | 4/10 [00:07<00:12,  2.03s/it]Average accuracy 0.585, latency 3.37, #q: 265 - clinical_knowledge
 50%|█████████████████████                     | 5/10 [00:11<00:12,  2.52s/it]Average accuracy 0.597, latency 2.14, #q: 144 - college_biology
 60%|█████████████████████████▏                | 6/10 [00:13<00:09,  2.39s/it]Average accuracy 0.340, latency 1.79, #q: 100 - college_chemistry
 70%|█████████████████████████████▍            | 7/10 [00:15<00:06,  2.20s/it]Average accuracy 0.400, latency 2.76, #q: 100 - college_computer_science
 80%|█████████████████████████████████▌        | 8/10 [00:17<00:04,  2.38s/it]Average accuracy 0.290, latency 1.93, #q: 100 - college_mathematics
 90%|█████████████████████████████████████▊    | 9/10 [00:19<00:02,  2.24s/it]Average accuracy 0.526, latency 2.77, #q: 173 - college_medicine
100%|█████████████████████████████████████████| 10/10 [00:22<00:00,  2.26s/it]
Total latency: 22.521
Average accuracy: 0.495

Benchmarking and Profiling

Benchmark

python -m sglang.bench_one_batch --model-path LiquidAI/LFM2-700M --batch 4 -
-input-len 2048 --output-len 1024 --attention-backend triton
WARNING:sglang.srt.server_args:Disabling overlap schedule since mamba no_buffer is not compatible with overlap schedule, try to use --disable-radix-cache if overlap schedule is necessary
[2025-12-29 18:33:02 TP0] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-29 18:33:02 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-12-29 18:33:03 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/home/blazingbhavneek/miniconda3/envs/sglang/lib/python3.11/site-packages/transformers/__init__.py)
[2025-12-29 18:33:03 TP0] Load weight begin. avail mem=15.33 GB
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.32it/s]

[2025-12-29 18:33:03 TP0] Load weight end. type=Lfm2ForCausalLM, dtype=torch.bfloat16, avail mem=13.81 GB, mem usage=1.52 GB.
[2025-12-29 18:33:03 TP0] Using KV cache dtype: torch.bfloat16
[2025-12-29 18:33:03 TP0] Mamba Cache is allocated. max_mamba_cache_size: 85292, conv_state size: 4.88GB, ssm_state size: 0.00GB 
[2025-12-29 18:33:03 TP0] KV Cache is allocated. #tokens: 473854, K size: 2.71 GB, V size: 2.71 GB
[2025-12-29 18:33:03 TP0] Memory pool end. avail mem=2.43 GB
[2025-12-29 18:33:03 TP0] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-29 18:33:03 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=2.42 GB
[2025-12-29 18:33:03 TP0] Capture cuda graph bs [1, 2, 4, 8]
Capturing batches (bs=1 avail_mem=2.39 GB): 100%|█| 4/4 [00:00<00:00,  7.80it/
[2025-12-29 18:33:04 TP0] Capture cuda graph end. Time elapsed: 1.01 s. mem usage=0.03 GB. avail mem=2.39 GB.
max_total_num_tokens=473854
Warmup ...
[2025-12-29 18:33:04 TP0] Reset HybridReqToTokenPool
Prefill. latency: 0.30282 s, throughput:  27052.23 token/s
Decode 0. Batch size: 4, latency: 0.17757 s, throughput:     22.53 token/s
Decode 1. Batch size: 4, latency: 0.00518 s, throughput:    771.52 token/s
Decode 2. Batch size: 4, latency: 0.00538 s, throughput:    743.27 token/s
Decode 3. Batch size: 4, latency: 0.00761 s, throughput:    525.59 token/s
Decode 4. Batch size: 4, latency: 0.00516 s, throughput:    775.31 token/s
Decode.  median latency: 0.00507 s, median throughput:    788.46 token/s
Total. latency:  0.636 s, throughput:  13084.74 token/s
Benchmark ...
[2025-12-29 18:33:05 TP0] Reset HybridReqToTokenPool
Prefill. latency: 0.27105 s, throughput:  30222.82 token/s
Decode 0. Batch size: 4, latency: 0.00520 s, throughput:    768.89 token/s
Decode 1. Batch size: 4, latency: 0.00498 s, throughput:    802.91 token/s
Decode 2. Batch size: 4, latency: 0.00717 s, throughput:    557.95 token/s
Decode 3. Batch size: 4, latency: 0.00514 s, throughput:    777.59 token/s
Decode 4. Batch size: 4, latency: 0.00503 s, throughput:    795.59 token/s
Decode.  median latency: 0.00482 s, median throughput:    830.67 token/s
Total. latency:  5.225 s, throughput:   2351.56 token/s

Checklist

[*] Format your code according to the Format code with pre-commit.
[*] Add unit tests according to the Run and add unit tests.
[*] Update documentation according to Write documentations.
[*] Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-29T09:39:06Z

Summary of Changes

Hello @blazingbhavneek, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands SGLang's model compatibility by introducing full support for the LiquidAI LFM2 model family. This integration allows users to leverage LFM2's efficient hybrid architecture, which is optimized for on-device deployment and fast inference, within the SGLang framework. The changes encompass the addition of model-specific configurations, the implementation of its unique attention and convolutional layers, and thorough validation through accuracy tests and benchmarks.

Highlights

New Model Support: Added comprehensive support for the LiquidAI LFM2 model family (700M, 1.2B, 2.6B), enabling their use within SGLang.
Hybrid Architecture Implementation: Implemented the LFM2 model's hybrid architecture, which combines attention and convolutional layers, drawing inspiration from vLLM and Hugging Face implementations.
Performance Validation: Included accuracy test results (MMLU, GSM8K) demonstrating comparable performance to vLLM, and provided benchmarking results for throughput.
System Integration: Integrated the new LFM2 model configuration and architecture into SGLang's core components, including model configuration, execution, and server arguments.
Documentation & Testing: Updated documentation to reflect LFM2 as a supported model and added unit tests to ensure proper functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the LiquidAI/LFM2 model family. The changes include adding the model configuration, the model implementation itself, and integrating it into the existing server infrastructure and tests. The implementation appears to be a solid port from vLLM and Hugging Face Transformers. I've identified a few areas for improvement, including a potential TypeError that could occur in the model's forward pass, a couple of unused parameters, and a redundant computation. My review includes specific suggestions to address these points.

python/sglang/srt/models/lfm2.py

python/sglang/srt/configs/lfm2.py

python/sglang/srt/models/lfm2.py

Kangyan-Zhou · 2025-12-30T03:06:51Z

/tag-and-rerun-ci

tugot17 · 2026-01-12T18:31:42Z

I also made a version of dense LFM2, utilizing the conv1d kernel and with function calling integrated. We probably should try to somehow merge these changes before:

#16890

blazingbhavneek · 2026-01-31T13:58:30Z

Closing this PR on request of LiquidAI reps

blazingbhavneek added 3 commits December 28, 2025 23:41

Server working, needs benchmarking + test

27a1b4c

All size variants working, test passing, added in docs

33d65a8

ready for pull, sorted imports

8b221be

blazingbhavneek requested review from Fridge003, Ying1123, hnyls2002, ispobock and merrymercy as code owners December 29, 2025 09:38

github-actions bot added the documentation Improvements or additions to documentation label Dec 29, 2025

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

python/sglang/srt/models/lfm2.py Outdated Show resolved Hide resolved

python/sglang/srt/configs/lfm2.py Outdated Show resolved Hide resolved

python/sglang/srt/models/lfm2.py Show resolved Hide resolved

python/sglang/srt/models/lfm2.py Outdated Show resolved Hide resolved

fixed issues pointed out by gemini bot

51b6efa

github-actions bot added the run-ci label Dec 30, 2025

ran pre commit

a36306e

blazingbhavneek closed this Jan 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model: Support LiquidAI/LFM2 (Dense) (700M,1.2B, 2.6B) #16065

Model: Support LiquidAI/LFM2 (Dense) (700M,1.2B, 2.6B) #16065
blazingbhavneek wants to merge 5 commits intosgl-project:mainfrom
blazingbhavneek:feature/support-lfm2

blazingbhavneek commented Dec 29, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kangyan-Zhou commented Dec 30, 2025

Uh oh!

tugot17 commented Jan 12, 2026

Uh oh!

blazingbhavneek commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

blazingbhavneek commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

MMLU:

LFM2-700M

SGLang

vLLM

Benchmarking and Profiling

Benchmark

Checklist

Uh oh!

gemini-code-assist bot commented Dec 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kangyan-Zhou commented Dec 30, 2025

Uh oh!

tugot17 commented Jan 12, 2026

Uh oh!

blazingbhavneek commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

blazingbhavneek commented Dec 29, 2025 •

edited

Loading