Skip to content

[Refactor] GLM-ASR Modeling#31779

Merged
Isotr0py merged 26 commits intovllm-project:mainfrom
JaredforReal:refactor/glmasr
Jan 7, 2026
Merged

[Refactor] GLM-ASR Modeling#31779
Isotr0py merged 26 commits intovllm-project:mainfrom
JaredforReal:refactor/glmasr

Conversation

@JaredforReal
Copy link
Contributor

@JaredforReal JaredforReal commented Jan 6, 2026

🚀 Key Improvements

1. Native vLLM Audio Encoder Implementation (glmasr.py)

  • Completely rewrote GlmAsrEncoder as a vLLM-native implementation with full optimization support:
    • QKVParallelLinear: Fused Q/K/V projections for efficient attention computation
    • ColumnParallelLinear/RowParallelLinear: Tensor parallelism support for distributed inference
    • Quantization support: Compatible with vLLM's quantization framework
    • Flash Attention (SDPA): Leverages PyTorch's scaled_dot_product_attention for optimized attention
  • Implemented GlmAsrRotaryEmbedding with pre-computed cos/sin cache for faster RoPE computation
  • Built GlmAsrAttention, GlmAsrMLP, and GlmAsrEncoderLayer with optimized layer norms and residual connections
  • Proper grouped query attention (GQA) handling with k/v repetition

2. Cleaner Architecture: Direct BaseMultiModalProcessor Inheritance (glmasr.py)

  • Refactored GlmAsrMultiModalProcessor to inherit directly from BaseMultiModalProcessor
  • Refactored GlmAsrProcessingInfo to inherit directly from BaseProcessingInfo
  • Refactored GlmAsrMultiModalDataParse to inherit directly from BaseMultiModalDataParse
  • Removed dependency on AudioFlamingo3 for cleaner, more maintainable code
  • Streamlined processing pipeline with better performance and reduced complexity
  • Maintained full compatibility with existing API and functionality

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Copilot AI review requested due to automatic review settings January 6, 2026 06:26
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great refactoring that replaces the GlmAsrEncoder with a native vLLM implementation, improving performance and maintainability by removing dependencies on AudioFlamingo3. The new implementation is well-structured and leverages vLLM's optimizations. I found one critical issue related to handling input sequences longer than the configured maximum, which could lead to a runtime error. I've provided a suggestion to fix it. Overall, this is a solid contribution.

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the GLM-ASR model implementation to use a native vLLM encoder with comprehensive performance optimizations including tensor parallelism, quantization support, and Flash Attention. The refactoring also simplifies the architecture by removing dependencies on AudioFlamingo3 base classes and instead directly inheriting from vLLM's base multimodal processor classes.

Key Changes

  • Native vLLM encoder: Complete rewrite of GlmAsrEncoder with optimized components (GlmAsrRotaryEmbedding, GlmAsrAttention, GlmAsrMLP, GlmAsrEncoderLayer) using QKVParallelLinear for fused projections, tensor parallelism support, and Flash Attention
  • Direct base class inheritance: Refactored GlmAsrProcessingInfo, GlmAsrMultiModalProcessor, and GlmAsrMultiModalDataParser to inherit from BaseProcessingInfo, BaseMultiModalProcessor, and MultiModalDataParser respectively, removing AudioFlamingo3 dependencies
  • Enhanced utility functions: Added rotary embedding helpers (_rotate_half, _apply_rotary_pos_emb, _repeat_kv) and improved audio length calculation logic in glmasr_utils.py

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File Description
vllm/model_executor/models/glmasr_utils.py Added RoPE and GQA utility functions; improved audio output length calculation with better documentation and refactored logic
vllm/model_executor/models/glmasr.py Complete native encoder implementation with optimized attention/MLP layers; refactored processor classes to remove AudioFlamingo3 dependency; inlined and improved audio processing logic in _call_hf_processor

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…hf_processor

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
@JaredforReal
Copy link
Contributor Author

@DarkLight1337 PTAL, thanks!

Signed-off-by: JaredforReal <w13431838023@gmail.com>
…mb_cos

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
@DarkLight1337
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring of the GLM-ASR modeling implementation. It successfully replaces the HuggingFace GlmAsrEncoder with a vLLM-native version, which should improve performance and maintainability. The code is well-structured and leverages vLLM's optimized components effectively. I've found one minor correctness issue in a helper function where the implementation doesn't match its documentation, which I've flagged for correction to prevent potential future bugs.

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. @Isotr0py can you check as well?

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now

@Isotr0py Isotr0py enabled auto-merge (squash) January 7, 2026 04:37
@JaredforReal
Copy link
Contributor Author

@DarkLight1337 @Isotr0py Thanks

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2026
@JaredforReal
Copy link
Contributor Author

JaredforReal commented Jan 7, 2026

@DarkLight1337 @Isotr0py Sorry guys, I need one more change to pass the device to transformers.whisper.feature_extractor(), for better performance
PTAL

auto-merge was automatically disabled January 7, 2026 06:25

Head branch was pushed to by a user without write access

@JaredforReal JaredforReal marked this pull request as draft January 7, 2026 09:08
@JaredforReal
Copy link
Contributor Author

@DarkLight1337 @Isotr0py Back to the version you guys approved. Let's land it for now, Thanks!

@Isotr0py Isotr0py marked this pull request as ready for review January 7, 2026 10:06
@Isotr0py Isotr0py enabled auto-merge (squash) January 7, 2026 10:07
@Isotr0py Isotr0py merged commit 9741387 into vllm-project:main Jan 7, 2026
53 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants