Enable MiMo-Audio-7B end-to-end inference on Intel XPU#2983
Enable MiMo-Audio-7B end-to-end inference on Intel XPU#2983Liangyx2 wants to merge 11 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
NON-BLOCKING:
-
Test Coverage — XPU is experimental and CI does not run on XPU hardware. Since this PR adds device-type guards and XPU-specific configuration, please verify manually on XPU that:
- The model loads successfully with the new mimo_audio.yaml config
- Inference produces valid audio output for at least one query type (e.g., tts_sft)
- No runtime errors from CUDA-specific APIs on XPU
Consider adding a note in the PR description confirming which XPU configuration was tested.
| num_reqs = len(request_ids) | ||
| is_capturing = torch.cuda.is_current_stream_capturing() | ||
| if torch.cuda.is_available() and input_ids.device.type == "cuda": | ||
| is_capturing = torch.cuda.is_current_stream_capturing() |
There was a problem hiding this comment.
Don't other platforms support torch.xxx.is_current_stream_capturing()?
| @@ -0,0 +1,103 @@ | |||
| # XPU stage config for running MiMo-Audio with 2-stage architecture | |||
There was a problem hiding this comment.
We don't introduce the new stage configs. Please refer to #2383 and add a correct deploy config.
There was a problem hiding this comment.
|
cc @qibaoyuan |
|
Thanks! Could you help us test this incoming PR on XPU and report any issues you encounter? |
PR Description
Motivation
MiMo-Audio (XiaomiMiMo) is a multi-modal audio model supporting TTS, voice cloning, audio transcription, and spoken dialogue. Currently it only runs on CUDA. This PR enables MiMo-Audio inference on XPU (Intel GPU) by adding platform-specific stage configs and fixing several CUDA-only code paths that prevented the model from loading and running on non-CUDA devices.
Technical Details
XPU stage config (mimo_audio.yaml): Added a 2-stage pipeline config (Stage 0:
fused_thinker_talkerfor LLM + audio code generation, Stage 1:code2wavfor waveform synthesis) with XPU-specific knobs (enforce_eager,disable_hybrid_kv_cache_manager,skip_mm_profiling, memory utilization tuning).Guard CUDA-only APIs: Wrapped all
torch.cuda.is_current_stream_capturing()calls inmimo_audio.py,mimo_audio_code2wav.py, andmimo_audio_llm.pywithtorch.cuda.is_available() and device.type == "cuda"checks, returningFalseon non-CUDA devices. This prevents runtime errors on XPU.Fix device-hardcoded defaults: Removed
torch.device(f"cuda:{torch.cuda.current_device()}")defaults inmimo_audio_llm.py'sgenerate_audio_tokens/generate_audio_tokens_one_stepmethods, replacing them withlocal_embeds.deviceto be device-agnostic.Fix multimodal processor: Added
_hf_processor_applies_updates() -> Falseoverride inMiMoAudioLLMMultiModalProcessorso that vllm correctly applies prompt updates (audio placeholder expansion) instead of assuming the HF processor already did it.Robustness improvements in
end2end.py:MAX_REF_AUDIO_SAMPLES) to prevent model confusion (repetition / voice identity loss) with long clips.code2wavinput tokens toMAX_CODE2WAV_TOKENS=8192in the stage input processor to prevent OOM.--texttoNoneso each query type uses its own sensible default.Performance Impact
enforce_eager=trueand conservative memory settings (gpu_memory_utilization: 0.4 / 0.35).Workload Mapping
XiaomiMiMo/MiMo-Audiomimo_audio.yaml(2-stage: fused_thinker_talker + code2wav)