Skip to content

Support bailing moe#8680

Merged
zhyncs merged 10 commits intosgl-project:mainfrom
ppraneth:support-bailing_moe
Aug 6, 2025
Merged

Support bailing moe#8680
zhyncs merged 10 commits intosgl-project:mainfrom
ppraneth:support-bailing_moe

Conversation

@ppraneth
Copy link
Copy Markdown
Contributor

@ppraneth ppraneth commented Aug 1, 2025

SGLang vs. vLLM: Verified MMLU Benchmark on Ling-lite Model

Model Metric SGLang vLLM SGLang Advantage
inclusionAI/Ling-lite Avg. MMLU Accuracy 70.0% 69.9% +0.1%
Total Latency 53.7s 94.8s 1.77x Faster

Comparison with Official MMLU Scores

For inclusionAI/Ling-lite:

  • Official MMLU Score: 71.2%
  • SGLang result was 1.2% lower than the official score.
  • vLLM result was 1.3% lower than the official score.

Ling-lite Benchmark Results

Metric SGLang vLLM SGLang vs. vLLM Difference
Average Accuracy 70.0% 69.9% +0.1%
Total Latency 53.655s 94.752s -41.097s (1.77x faster)

Detailed Subject-by-Subject Comparison (Ling-lite)

Subject Questions SGLang Acc vLLM Acc SGLang vs. vLLM Diff
Abstract Algebra 100 48.0% 49.0% -1.0%
Anatomy 135 71.1% 71.1% 0.0%
Astronomy 152 77.0% 78.3% -1.3%
Business Ethics 100 75.0% 74.0% +1.0%
Clinical Knowledge 265 78.5% 77.7% +0.8%
College Biology 144 80.6% 78.5% +2.1%
College Chemistry 100 54.0% 54.0% 0.0%
College Computer Science 100 67.0% 64.0% +3.0%
College Mathematics 100 45.0% 46.0% -1.0%
College Medicine 173 74.0% 73.4% +0.6%
College Physics 102 49.0% 48.0% +1.0%
Computer Security 100 75.0% 76.0% -1.0%
Conceptual Physics 235 76.2% 75.7% +0.5%
Econometrics 114 57.9% 57.9% 0.0%
Electrical Engineering 145 69.7% 71.0% -1.3%
Elementary Mathematics 378 70.4% 70.4% 0.0%
Formal Logic 126 55.6% 56.3% -0.7%
Global Facts 100 43.0% 46.0% -3.0%
High School Biology 310 89.0% 89.7% -0.7%
High School Chemistry 203 61.6% 62.6% -1.0%
High School Computer Science 100 84.0% 84.0% 0.0%
High School European History 165 81.2% 83.0% -1.8%
High School Geography 198 89.4% 89.9% -0.5%
High School Government and Politics 193 93.3% 92.7% +0.6%
High School Macroeconomics 390 74.6% 74.9% -0.3%
High School Mathematics 270 53.3% 52.2% +1.1%
High School Microeconomics 238 84.9% 84.9% 0.0%
High School Physics 151 57.6% 57.6% 0.0%
High School Psychology 545 87.7% 88.3% -0.6%
High School Statistics 216 67.1% 67.1% 0.0%
High School US History 204 80.9% 79.9% +1.0%
High School World History 237 82.7% 82.7% 0.0%
Human Aging 223 69.1% 67.7% +1.4%
Human Sexuality 131 79.4% 78.6% +0.8%
International Law 121 80.2% 80.2% 0.0%
Jurisprudence 108 76.9% 76.9% 0.0%
Logical Fallacies 163 80.4% 79.8% +0.6%
Machine Learning 112 65.2% 65.2% 0.0%
Management 103 85.4% 85.4% 0.0%
Marketing 234 86.8% 86.8% 0.0%
Medical Genetics 100 78.0% 81.0% -3.0%
Miscellaneous 783 85.2% 84.5% +0.7%
Moral Disputes 346 74.3% 74.6% -0.3%
Moral Scenarios 895 36.5% 36.2% +0.3%
Nutrition 306 76.1% 75.2% +0.9%
Philosophy 311 75.6% 76.2% -0.6%
Prehistory 324 79.6% 79.0% +0.6%
Professional Accounting 282 55.0% 53.5% +1.5%
Professional Law 1534 49.9% 50.8% -0.9%
Professional Medicine 272 75.0% 73.9% +1.1%
Professional Psychology 612 75.5% 74.5% +1.0%
Public Relations 110 70.9% 69.1% +1.8%
Security Studies 245 78.0% 78.4% -0.4%
Sociology 201 87.6% 88.1% -0.5%
US Foreign Policy 100 82.0% 83.0% -1.0%
Virology 166 53.6% 54.2% -0.6%
World Religions 171 84.2% 84.8% -0.6%

Technical Configurations

SGLang Benchmark Commands

# For Ling-lite
!python -m sglang.launch_server \
  --model-path inclusionAI/Ling-lite \
  --port 30000
!python3 bench_sglang.py 

Vllm Benchmark Commands

# For Ling-lite
!python3 -m vllm.entrypoints.api_server \
  --tokenizer-mode auto \
  --model inclusionAI/Ling-lite \
  --disable-log-requests \
  --port 21000
!python3 bench_other.py  --backend vllm

Motivation

Closes #8621

Modifications

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ppraneth, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've reviewed this pull request, which primarily focuses on extending our model compatibility by adding support for Bailing Mixture-of-Experts (MoE) models. This integration includes the inclusionAI/Ling-lite and Ling-plus models, enhancing our framework's ability to handle these advanced architectures for improved performance and flexibility.

Highlights

  • New Model Support: I've added comprehensive support for Bailing Mixture-of-Experts (MoE) models, specifically integrating inclusionAI/Ling-lite and Ling-plus, which significantly expands the range of advanced models our system can serve.
  • Core MoE Architecture Implementation: A new Python module (python/sglang/srt/models/bailing_moe.py) has been introduced. This module defines the complete Bailing MoE model architecture, including its attention, MLP, and expert layers, adapted from the vLLM project to ensure compatibility and performance.
  • Specialized Weight Loading: I've implemented custom weight loading logic within the new Bailing MoE model. This ensures correct handling of the unique parameter structures inherent to MoE models, such as stacked and expert-specific weights, facilitating seamless model integration.
  • Documentation and Testing: The docs/supported_models/generative_models.md has been updated to reflect the newly supported Ling MoE models. Additionally, a dedicated test case for inclusionAI/Ling-lite has been added to test/srt/models/test_generation_models.py to validate its proper functionality and integration within our generation framework.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Bailing MoE model, including the model implementation, a test case, and documentation updates. The implementation appears correct and follows the project's patterns. I've provided a few suggestions to improve the documentation's clarity and the code's maintainability. Overall, great work!

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@strgrb strgrb self-requested a review August 4, 2025 03:29
@strgrb
Copy link
Copy Markdown
Collaborator

strgrb commented Aug 4, 2025

@ppraneth Great Job! I'll take a look.

@strgrb strgrb requested a review from zhyncs August 4, 2025 03:34
@strgrb
Copy link
Copy Markdown
Collaborator

strgrb commented Aug 4, 2025

@ant-yy Please help to take a look

@jeejeelee
Copy link
Copy Markdown

@ppraneth Can you provide your benchmark detail information? Such as GPU information, bench scripts, etc.
I'm a bit curious about the latency differences between vllm vs sglang and want to reproduce it.

@ppraneth
Copy link
Copy Markdown
Contributor Author

ppraneth commented Aug 4, 2025

@ppraneth Can you provide your benchmark detail information? Such as GPU information, bench scripts, etc. I'm a bit curious about the latency differences between vllm vs sglang and want to reproduce it.

I ran it on H100
Benchmark code:
SGLang:

import threading
import subprocess
import time
import os

def launch_server():
  
    process = subprocess.Popen(
        ["python", "-m", "sglang.launch_server", "--model-path", "inclusionAI/Ling-lite","--trust-remote-code","--port", "30000"],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
    )
    for line in process.stdout:
        print("[SERVER]", line.strip())

# Start server thread
server_thread = threading.Thread(target=launch_server)
server_thread.start()


!python3 bench_sglang.py 

Vllm:

import threading
import subprocess
import time

def launch_server():

    process = subprocess.Popen(
        ["python", "-m", "vllm.entrypoints.api_server","--tokenizer-mode","auto","--model", "inclusionAI/Ling-lite","--trust-remote-code","--disable-log-requests","--port", "21000"],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True
    )
    for line in process.stdout:
        print("[SERVER]", line.strip())

# Start server thread
server_thread = threading.Thread(target=launch_server)
server_thread.start()

!python3 bench_other.py --backend vllm

@jeejeelee
Copy link
Copy Markdown

Thank you very much for your reply. Can you provide bench_sglang.py and bench_other.py?

@ppraneth
Copy link
Copy Markdown
Contributor Author

ppraneth commented Aug 4, 2025

Thank you very much for your reply. Can you provide bench_sglang.py and bench_other.py?

https://github.com/sgl-project/sglang/tree/main/benchmark/mmlu

Copy link
Copy Markdown
Collaborator

@strgrb strgrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jeejeelee
Copy link
Copy Markdown

I am using A800 locally, then following the instructions to conduct testing. The conclusions are as follows:

  • sglang
Total latency: 80.394
Average accuracy: 0.715
  • vllm
Total latency: 75.417
Average accuracy: 0.718

For my environment, please refer to the attachment
test_lantency.txt

Once again, thank you for your helpful response

@zhyncs zhyncs merged commit d26ca84 into sgl-project:main Aug 6, 2025
57 of 63 checks passed
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Model] Add Ling-lite and Ling-plus

5 participants