[Refactor] GLM-ASR Modeling#31779
Conversation
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
There was a problem hiding this comment.
Code Review
This is a great refactoring that replaces the GlmAsrEncoder with a native vLLM implementation, improving performance and maintainability by removing dependencies on AudioFlamingo3. The new implementation is well-structured and leverages vLLM's optimizations. I found one critical issue related to handling input sequences longer than the configured maximum, which could lead to a runtime error. I've provided a suggestion to fix it. Overall, this is a solid contribution.
Signed-off-by: JaredforReal <w13431838023@gmail.com>
There was a problem hiding this comment.
Pull request overview
This pull request refactors the GLM-ASR model implementation to use a native vLLM encoder with comprehensive performance optimizations including tensor parallelism, quantization support, and Flash Attention. The refactoring also simplifies the architecture by removing dependencies on AudioFlamingo3 base classes and instead directly inheriting from vLLM's base multimodal processor classes.
Key Changes
- Native vLLM encoder: Complete rewrite of
GlmAsrEncoderwith optimized components (GlmAsrRotaryEmbedding,GlmAsrAttention,GlmAsrMLP,GlmAsrEncoderLayer) using QKVParallelLinear for fused projections, tensor parallelism support, and Flash Attention - Direct base class inheritance: Refactored
GlmAsrProcessingInfo,GlmAsrMultiModalProcessor, andGlmAsrMultiModalDataParserto inherit fromBaseProcessingInfo,BaseMultiModalProcessor, andMultiModalDataParserrespectively, removing AudioFlamingo3 dependencies - Enhanced utility functions: Added rotary embedding helpers (
_rotate_half,_apply_rotary_pos_emb,_repeat_kv) and improved audio length calculation logic inglmasr_utils.py
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
vllm/model_executor/models/glmasr_utils.py |
Added RoPE and GQA utility functions; improved audio output length calculation with better documentation and refactored logic |
vllm/model_executor/models/glmasr.py |
Complete native encoder implementation with optimized attention/MLP layers; refactored processor classes to remove AudioFlamingo3 dependency; inlined and improved audio processing logic in _call_hf_processor |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…hf_processor Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
|
@DarkLight1337 PTAL, thanks! |
Signed-off-by: JaredforReal <w13431838023@gmail.com>
…mb_cos Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant refactoring of the GLM-ASR modeling implementation. It successfully replaces the HuggingFace GlmAsrEncoder with a vLLM-native version, which should improve performance and maintainability. The code is well-structured and leverages vLLM's optimized components effectively. I've found one minor correctness issue in a helper function where the implementation doesn't match its documentation, which I've flagged for correction to prevent potential future bugs.
Signed-off-by: JaredforReal <w13431838023@gmail.com>
DarkLight1337
left a comment
There was a problem hiding this comment.
LGTM now. @Isotr0py can you check as well?
|
@DarkLight1337 @Isotr0py Thanks |
|
@DarkLight1337 @Isotr0py Sorry guys, I need one more change to pass the device to transformers.whisper.feature_extractor(), for better performance |
Head branch was pushed to by a user without write access
56a9ed9 to
430b19c
Compare
|
@DarkLight1337 @Isotr0py Back to the version you guys approved. Let's land it for now, Thanks! |
Signed-off-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
🚀 Key Improvements
1. Native vLLM Audio Encoder Implementation (
glmasr.py)GlmAsrEncoderas a vLLM-native implementation with full optimization support:scaled_dot_product_attentionfor optimized attentionGlmAsrRotaryEmbeddingwith pre-computed cos/sin cache for faster RoPE computationGlmAsrAttention,GlmAsrMLP, andGlmAsrEncoderLayerwith optimized layer norms and residual connections2. Cleaner Architecture: Direct BaseMultiModalProcessor Inheritance (
glmasr.py)GlmAsrMultiModalProcessorto inherit directly fromBaseMultiModalProcessorGlmAsrProcessingInfoto inherit directly fromBaseProcessingInfoGlmAsrMultiModalDataParseto inherit directly fromBaseMultiModalDataParseAudioFlamingo3for cleaner, more maintainable codePurpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.