feat: add qwen2.5-omni #143

weedge · 2025-04-12T06:53:43Z

feat:

add qwen2.5-omni task demo on modal

# run all
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s all

# run streaming cases
## run thinker-only token streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c asr_stream -d L4

## run thinker-only chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c thinker_chunk_stream -d L4

## run thinker talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_stream -d L4

## run  text-> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_segment_stream -d L40s

## run  vision (video with audio) -> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_chunk_stream -d L40s



# NOTE: if want to generate speech, need use SPEECH_SYS_PROMPT to generate speech

# asr (audio understanding)
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task universal_audio_understanding

# audio to text and speech
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task voice_chatting

# vision(video no audio) to text
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task video_information_extracting
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task screen_recording_interaction

# vision(video with audio) to text and speech
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_math
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_music

# vision(video with audio) to text and speech with multi rounds chat, but need more GPU memory
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen2_5omni.py --task multi_round_omni_chatting

# batch requests
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen2_5omni.py --task batch_requests

# stream
# text -> text stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task thinker_chunk_stream 
# image -> text stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task image_stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task image_chunk_stream
# audio -> text stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task asr_stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task asr_chunk_stream
# video -> text stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task screen_recording_interaction_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task video_information_extracting_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task video_information_extracting_chunk_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task screen_recording_interaction_chunk_stream

# text -> text + chunk speech stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_stream

# text -> chunk text+speech stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_segment_stream

# vision(video with audio) -> text + chunk speech stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_math_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_music_stream

# vision(video with audio) -> chunk text+speech stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_math_chunk_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_music_chunk_stream

# text/vision/audio -> chunk text+speech stream  use sliding window code2wav with achatbot package
ACHATBOT_VERSION=0.0.9.post10 IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task achatbot_generate

add qwen2.5-omni vllm example(thinker_only, thinker2talker2wav, code2wav) on modal

IMAGE_GPU=L40s modal run src/llm/vllm/qwen2_5omni.py --task thinker_only

IMAGE_GPU=L40s modal run src/llm/vllm/qwen2_5omni.py --task thinker2talker2wav
#IMAGE_GPU=L40s:2 modal run src/llm/vllm/qwen2_5omni.py --task thinker2talker2wav --thinker-gpu-memory-utilization 0.9 --talker-gpu-memory-utilization 0.7

# slow with no torch compile
IMAGE_GPU=T4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav
IMAGE_GPU=L4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav

# fast with torch compile
IMAGE_GPU=L4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav --other-cmd-args "--enable-torch-compile"
IMAGE_GPU=L40s modal run src/llm/vllm/qwen2_5omni.py --task code2wav --other-cmd-args "--enable-torch-compile"
IMAGE_GPU=L4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav --other-cmd-args "--enable-torch-compile --odeint-method euler"
IMAGE_GPU=L4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav --other-cmd-args "--enable-torch-compile --multi-waveforms"

add qwen2_code2wav streaming from vllm, change it for achatbot (cfm dit + bigvgan) (maybe add zmq as connector)
add qwen2_5omni_asr and unit test

LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \
    THINKER_LLM_GEN_TEMPERATURE=0.9 \
    LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \
    python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe_stream

LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \
    THINKER_LLM_GEN_TEMPERATURE=0.9 \
    LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \
    python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe

LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \
    THINKER_LLM_GEN_TEMPERATURE=0.9 \
    LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \
    python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe_with_bytes

deploy modal fastapi-webrtc serve to run personal bot one by one

vision bot
run webrtc_qwen2_5omni_vision_voice_bot serve with webrtc

# webrtc_vision_bot serve on qwen2.5omni vision llm 
IMAGE_NAME=qwen2.5omni IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L4 modal serve -e achatbot src/fastapi_webrtc_vision_bot_serve.py

curl api to run chat room bot with webrtc (daily/livekit/agora) use livekit_room

curl --location 'https://weedge-achatbot--fastapi-webrtc-vision-qwen2-5omni-b-4cb328-dev.modal.run/bot_join/chat-room/LivekitDescribeVisionBot' \
--header 'Content-Type: application/json' \
--data '{
  "chat_bot_name": "LivekitDescribeVisionBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitDescribeVisionBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "asr": "sense_voice",
    "llm": "llm_transformers_manual_qwen2_5omni_vision",
    "tts": "edge"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": { "stop_secs": 0.7 }
    },
    "asr": {
      "tag": "sense_voice_asr",
      "args": {
        "language": "zn",
        "model_name_or_path": "/root/.achatbot/models/FunAudioLLM/SenseVoiceSmall"
      }
    },
    "llm": {
      "tag": "llm_transformers_manual_qwen2_5omni_vision",
      "args": {
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 1,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151644, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
      }
    },
    "tts": {
      "tag": "tts_edge",
      "args": {
        "voice_name": "zh-CN-YunjianNeural",
        "language": "zh",
        "gender": "Male"
      }
    }
  },
  "config_list": []
}
'

vision voice bot

run webrtc_qwen2_5omni_vision_voice_bot serve with webrtc

# webrtc_audio_bot serve on default pip image
# need create .env.example to modal Secrets for webrtc key
IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L40s modal serve -e achatbot src/fastapi_webrtc_qwen2_5omni_vision_voice_bot_serve.py

curl api to run chat room bot with webrtc (livekit_room)

# thinker gen chunk token and hidden states -> talker gen vq codes token -> code2wav gen chunk wav | don't use_sliding_window_code2wav
curl --location 'https://weedge-achatbot--fastapi-webrtc-qwen2-5omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitQwen2_5OmniVisionVoiceBot' \
--header 'Content-Type: application/json' \
--data '{
  "chat_bot_name": "LivekitQwen2_5OmniVisionVoiceBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitQwen2_5OmniVisionVoiceBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "omni_llm": "llm_transformers_manual_qwen2_5omni_vision_voice"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": { "stop_secs": 0.7 }
    },
    "omni_llm": {
      "tag": "llm_transformers_manual_qwen2_5omni_vision_voice",
      "args": {
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 1,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151644, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 2048,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_skip_thinker_token_ids": [],
        "talker_eos_token_ids": [8292, 8294],
        "code2wav_args": {
          "model_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B",
          "enable_torch_compile": false,
          "enable_torch_compile_first_chunk": false,
          "odeint_method": "euler",
          "odeint_method_relaxed": false,
          "batched_chunk": 3,
          "frequency": "50hz",
          "device": "cuda",
          "num_steps": 10,
          "guidance_scale": 0.5,
          "sway_coefficient": -1.0,
          "code2wav_dynamic_batch": false
        },
        "speaker": "Chelsie",
        "is_use_sliding_window_code2wav": false,
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
      }
    }
  },
  "config_list": []
}
'
# thinker gen chunk token and hidden states -> talker gen vq codes token -> code2wav gen chunk wav | use_sliding_window_code2wav | no torch.compile
curl --location 'https://weedge-achatbot--fastapi-webrtc-qwen2-5omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitQwen2_5OmniVisionVoiceBot' \
--header 'Content-Type: application/json' \
--data '{
  "chat_bot_name": "LivekitQwen2_5OmniVisionVoiceBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitQwen2_5OmniVisionVoiceBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "omni_llm": "llm_transformers_manual_qwen2_5omni_vision_voice"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": { "stop_secs": 0.7 }
    },
    "omni_llm": {
      "tag": "llm_transformers_manual_qwen2_5omni_vision_voice",
      "args": {
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 1,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151644, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 2048,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_skip_thinker_token_ids": [],
        "talker_eos_token_ids": [8292, 8294],
        "code2wav_args": {
          "model_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B",
          "enable_torch_compile": false,
          "enable_torch_compile_first_chunk": false,
          "odeint_method": "euler",
          "odeint_method_relaxed": false,
          "batched_chunk": 3,
          "frequency": "50hz",
          "device": "cuda",
          "num_steps": 10,
          "guidance_scale": 0.5,
          "sway_coefficient": -1.0,
          "code2wav_dynamic_batch": false
        },
        "speaker": "Chelsie",
        "is_use_sliding_window_code2wav": true,
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
      }
    }
  },
  "config_list": []
}
'

qwen2.5omni voice bot (speech->text + speech)

curl --location 'https://weedge-achatbot--fastapi-webrtc-qwen2-5omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitQwen2_5OmniVoiceBot' \
--header 'Content-Type: application/json' \
--data '{
  "chat_bot_name": "LivekitQwen2_5OmniVoiceBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitQwen2_5OmniVoiceBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "voice_llm": "llm_transformers_manual_qwen2_5omni_audio_voice"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": { "stop_secs": 0.7 }
    },
    "voice_llm": {
      "tag": "llm_transformers_manual_qwen2_5omni_audio_voice",
      "args": {
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 1,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151644, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 2048,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_skip_thinker_token_ids": [],
        "talker_eos_token_ids": [8292, 8294],
        "code2wav_args": {
          "model_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B",
          "enable_torch_compile": false,
          "enable_torch_compile_first_chunk": false,
          "odeint_method": "euler",
          "odeint_method_relaxed": false,
          "batched_chunk": 3,
          "frequency": "50hz",
          "device": "cuda",
          "num_steps": 10,
          "guidance_scale": 0.5,
          "sway_coefficient": -1.0,
          "code2wav_dynamic_batch": false
        },
        "speaker": "Chelsie",
        "is_use_sliding_window_code2wav": true,
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
      }
    }
  },
  "config_list": []
}
'

asr + qwen2.5omni voice bot (text -> text + speech)

curl --location 'https://weedge-achatbot--fastapi-webrtc-qwen2-5omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitAsrQwen2_5OmniVoiceBot' \
--header 'Content-Type: application/json' \
--data '{
    "chat_bot_name": "LivekitAsrQwen2_5OmniVoiceBot",
    "room_name": "chat-room",
    "room_url": "",
    "token": "",
    "room_manager": {
        "tag": "livekit_room",
        "args": {
            "bot_name": "LivekitAsrQwen2_5OmniVoiceBot",
            "is_common_session": false
        }
    },
    "services": {
        "pipeline": "achatbot",
        "vad": "silero",
        "asr": "sense_voice",
        "voice_llm": "llm_transformers_manual_qwen2_5omni_text_voice"
    },
    "config": {
        "vad": {
            "tag": "silero_vad_analyzer",
            "args": {
                "stop_secs": 0.7
            }
        },
        "asr": {
            "args": {
                "language": "zn",
                "model_name_or_path": "/root/.achatbot/models/FunAudioLLM/SenseVoiceSmall"
            },
            "tag": "sense_voice_asr"
        },
        "voice_llm": {
            "tag": "llm_transformers_manual_qwen2_5omni_text_voice",
            "args": {
                "lm_device": "cuda",
                "lm_torch_dtype": "bfloat16",
                "lm_attn_impl": "flash_attention_2",
                "warmup_steps": 1,
                "chat_history_size": 0,
                "thinker_eos_token_ids": [
                    151644,
                    151645
                ],
                "thinker_args": {
                    "lm_gen_temperature": 0.95,
                    "lm_gen_top_k": 20,
                    "lm_gen_top_p": 0.9,
                    "lm_gen_min_new_tokens": 1,
                    "lm_gen_max_new_tokens": 1024,
                    "lm_gen_max_tokens_per_step": 10,
                    "lm_gen_repetition_penalty": 1.1
                },
                "talker_args": {
                    "lm_gen_temperature": 0.95,
                    "lm_gen_top_k": 20,
                    "lm_gen_top_p": 0.9,
                    "lm_gen_min_new_tokens": 1,
                    "lm_gen_max_new_tokens": 2048,
                    "lm_gen_repetition_penalty": 1.1
                },
                "talker_skip_thinker_token_ids": [],
                "talker_eos_token_ids": [
                    8292,
                    8294
                ],
                "code2wav_args": {
                    "model_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B",
                    "enable_torch_compile": false,
                    "enable_torch_compile_first_chunk": false,
                    "odeint_method": "euler",
                    "odeint_method_relaxed": false,
                    "batched_chunk": 3,
                    "frequency": "50hz",
                    "device": "cuda",
                    "num_steps": 10,
                    "guidance_scale": 0.5,
                    "sway_coefficient": -1.0,
                    "code2wav_dynamic_batch": false
                },
                "speaker": "Chelsie",
                "is_use_sliding_window_code2wav": true,
                "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
            }
        }
    },
    "config_list": []
}'

TMRoPE (Time-aligned Multimodal RoPE):

talker -> code --> cfm dit->mel --> bigvgan -> waveforms streaming: (generate source code see vllm)

reference

code2wav

inference

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (Chunked Prefills)
inference framework support:

Tip

在推理工程架构上可以将服务分成 text/audio/image/video tokenizer, thinker, talker, 以及 code2wav 模块，(直接从整体权重中加载模块对应权重，也可以写一个脚本把权重拆分重新保存，然后单独加载），以便在架构上异构利用不同的计算存储资源，提高推理的延迟和吞吐；也便于模块的单独测试维护。（thinker, talker这些llm结构推理，则利用已有的并行推理, 以及推理优化策略）
现在的omni（vision+voice）是对已有模块加入多模态token对齐的冷冻，焊接训练，thinker 和 talker 通过 projector hidden states 对接上, thinker模块主打LLM, 对于架构而言无本质变化
transformers 支持的qwen2.5omni dit cfm 仅支持 RungeKutta4ODESolver（rk4）
vllm 和 sglang推理框架的支持都还不完善，后续对 thinker 和 talker LM 加上推理优化（chunked prefills）
官方fork 的vllm代码和transformers支持的qwen2.5omni 都不支持 text 和 audio的双流同时输出，因为talker需要thinker的complete finish输出的 hidden states 作为prompt input, 以及embedding; 仅支持单独 thinker text stream 和 talker audio stream (code2wav(dit flow + bigvgan) 类似 F5-TTS)；TTS 任务需要加入指令数据集进行微调下（但是执行链条变长了，和minicpmo类似）
- minicpmo 支持text 和audio的双流同时输出，llm generate的时候，每次生成最大的token数是3，然后给后续模块生成对应音频，每次最多生成3个，为了避免kv重新计算，需要使用prefill + kvcache, 见： https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/modeling_minicpmo.py#L1232 ；同理qwen2.5omni也可以按照thinker llm生成的tokens片段来生成音频，结合滑动上下文窗口; 在原有的基础上加上分片逻辑即可，（类似大数据中的实时流式处理，同一个道理，本质是一样的, 序列数据变成了矩阵形式的数据(本质上都是序列数据)，所以大数据处理流程中也可以引入处理矩阵的数据，结合硬件并行加速矩阵计算，比如tensorRT）
- 演变过程：『 inference no streaming 』 => 『 thinker-only streaming 』 => 『 segment thinker-only streaming 』=> 『 thinker -> (talker + code2wav chunk) streaming 』=> 『 segment thinker -> segment (talker + code2wav chunk) streaming 』 => 『 segment thinker -> Concurrency Batch segment (talker + code2wav chunk) streaming 』=> Concurrency(online) / Batch(offline) 『 segment thinker -> Concurrency Batch segment (talker + code2wav chunk) streaming 』
- 演变成『 segment thinker -> segment (talker + code2wav chunk) streaming 』 TTFT(chunk)的延迟减少，但是会额外多出一部分显存(缓存)来存放一些上下文信息，空间换时间
- qwen2.5omni『 segment thinker -> segment (talker + code2wav chunk) streaming 』生成的语音片段，生成质量不连贯，如果最求语音质量，不需要按文本流式输出场景，可以使用『 thinker -> (talker + code2wav chunk) streaming 』的模式，『 segment thinker -> segment (talker + code2wav chunk) streaming 』模式的参数 thinker_max_new_tokens = thinker_max_tokens_per_step，则整体thinker生成完之后，才进行语音片段流式输出。
- 对于 qwen2.5omni『 segment thinker -> segment (talker + code2wav chunk) streaming 』生成的语音片段，生成质量不连贯的case，需要调整thinker_max_tokens_per_step 参数，以及 thinker_eos_token_ids 加入断句的token_id, 和 TTS 一样，尽可能一次生成的语音是一段文本。
- 对生成的文本分片之后，后续生成语音，可以并发批量生成，提高吞吐，本质上套用了数据处理中高吞吐优化方法，批量会占用更多的存储空间(内存或者显存)

        # talker's structure of prompt tokens, embeddings and thinker_reply_part:
        #
        #   tokens: [input_tokens] + [codec_pad_token] + [codec_bos_token]
        #   embeddings: [input_embeds] + [text_bos_token] + [thinker_reply_part[0]]
        #   thinker_reply_part: [thinker_reply_part[1:]] + [text_eos_token] + [text_pad_token]

Note

use vllm inference:

thinker LM: model weights take 16.73GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 5.48GiB; the rest of the memory reserved for KV Cache, so the total memory reserved for the model is 22.3 GiB. must thinker-gpu-memory-utilization * total_gpu_memory > 22.3 GiB
talker LM: model weights take 2.55GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 4.36GiB; the rest of the memory reserved for KV Cache, so the total memory reserved for the model is 6.9 GiB. must talker-gpu-memory-utilization * total_gpu_memory > 6.9 GiB

Signed-off-by: weedge <[email protected]>

gemini-code-assist

Hello @weedge, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a new feature to run Qwen2.5-Omni, a multimodal model, on Modal. It includes the necessary code to set up the environment, load the model, and run inference on various tasks such as audio understanding, voice chatting, and video information extraction. The pull request also adds a script to automate the process of downloading models and assets, as well as running the different test cases.

Highlights

Qwen2.5-Omni Integration: Adds support for running Qwen2.5-Omni, a multimodal model, on the Modal platform.
Task Demos: Includes demos for various tasks such as universal audio understanding, voice chatting, video information extraction, screen recording interaction, and omni-chatting for math and music.
Automated Script: Introduces a shell script to automate downloading models/assets and running test cases with different configurations (GPU, task, etc.).

Changelog

Click here to see the changelog

.gitignore
- Added *.mp4 to the ignored files list on line 165.
- Ensured *.xml is included in the git repository on line 210.
deploy/modal/src/llm/transformers/qwen2_5omni.py
- Introduces a new Modal app for Qwen2.5-Omni.
- Sets up the environment with necessary dependencies (transformers, torch, flash-attn, etc.).
- Defines functions for various multimodal tasks, including audio understanding, voice chatting, and video information extraction.
- Implements an inference function to process and generate responses based on different input types (audio, images, videos).
- Adds a main function to run the different tasks based on user input.
deploy/modal/src/llm/transformers/run_omni_cases.sh
- Introduces a shell script to automate the process of downloading models and assets.
- Provides command-line arguments to configure the GPU, task, model type, and transformers commit.
- Includes functions to run different test cases with specified configurations.
- Adds a usage function to display help information.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

A model of sight and sound,
Qwen's Omni, profound.
Modal's cloud takes flight,
Processing day and night,
New AI wonders abound.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the Qwen2.5-Omni model to the modal deployment, including necessary dependencies, configuration, and example tasks. The code appears well-structured and includes several example use cases. However, there are a few areas that could be improved for clarity and maintainability.

Summary of Findings

Missing Error Handling: The subprocess.run calls in qwen2_5omni.py lack error handling. If these commands fail, the script will continue, potentially leading to incorrect results or unexpected behavior. Consider adding error checking to ensure the commands execute successfully.
Hardcoded Paths: The paths HF_MODEL_DIR and ASSETS_DIR are hardcoded in qwen2_5omni.py. It would be better to make these configurable via environment variables to allow for more flexible deployment.
Inconsistent Use of use_audio_in_video: The use_audio_in_video parameter is used inconsistently across different function calls in qwen2_5omni.py. Ensure that this parameter is used correctly and consistently to avoid unexpected behavior.

Merge Readiness

The pull request introduces a significant new feature and includes example tasks, which is commendable. However, the missing error handling and hardcoded paths should be addressed before merging. I am unable to directly approve this pull request, and recommend that other reviewers also examine this code before merging. At a minimum, the high severity issues should be addressed before merging.

deploy/modal/src/llm/transformers/qwen2_5omni.py

weedge · 2025-04-12T06:54:06Z

Qwen2.5Omni: 10732.225408 M parameters

Qwen2_5OmniForConditionalGeneration(
  (thinker): Qwen2_5OmniThinkerForConditionalGeneration(
    (audio_tower): Qwen2_5OmniAudioEncoder(
      (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (positional_embedding): SinusoidsPositionEmbedding()
      (audio_bos_eos_token): Embedding(2, 3584)
      (layers): ModuleList(
        (0-31): 32 x Qwen2_5OmniAudioEncoderLayer(
          (self_attn): Qwen2_5OmniAudioFlashAttention2(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bias=True)
          (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        )
      )
      (ln_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (avg_pooler): AvgPool1d(kernel_size=(2,), stride=(2,), padding=(0,))
      (proj): Linear(in_features=1280, out_features=3584, bias=True)
    )
    (visual): Qwen2_5OmniVisionEncoder(
      (patch_embed): Qwen2_5_VisionPatchEmbed(
        (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
      )
      (rotary_pos_emb): Qwen2_5_VisionRotaryEmbedding()
      (blocks): ModuleList(
        (0-31): 32 x Qwen2_5OmniVisionBlock(
          (norm1): Qwen2RMSNorm((1280,), eps=1e-06)
          (norm2): Qwen2RMSNorm((1280,), eps=1e-06)
          (attn): Qwen2_5OmniVisionFlashAttention2(
            (q): Linear(in_features=1280, out_features=1280, bias=True)
            (k): Linear(in_features=1280, out_features=1280, bias=True)
            (v): Linear(in_features=1280, out_features=1280, bias=True)
            (proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (mlp): Qwen2_5OmniMLP(
            (gate_proj): Linear(in_features=1280, out_features=3420, bias=True)
            (up_proj): Linear(in_features=1280, out_features=3420, bias=True)
            (down_proj): Linear(in_features=3420, out_features=1280, bias=True)
            (act_fn): SiLU()
          )
        )
      )
      (merger): Qwen2_5OmniPatchMerger(
        (ln_q): Qwen2RMSNorm((1280,), eps=1e-06)
        (mlp): Sequential(
          (0): Linear(in_features=5120, out_features=5120, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=5120, out_features=3584, bias=True)
        )
      )
    )
    (model): Qwen2_5OmniThinkerTextModel(
      (embed_tokens): Embedding(152064, 3584)
      (layers): ModuleList(
        (0-27): 28 x Qwen2_5OmniDecoderLayer(
          (self_attn): Qwen2_5OmniFlashAttention2(
            (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
            (k_proj): Linear(in_features=3584, out_features=512, bias=True)
            (v_proj): Linear(in_features=3584, out_features=512, bias=True)
            (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
            (rotary_emb): Qwen2_5OmniRotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
          (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        )
      )
      (norm): Qwen2RMSNorm((3584,), eps=1e-06)
      (rotary_emb): Qwen2_5OmniRotaryEmbedding()
    )
    (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
  )
  (talker): Qwen2_5OmniTalkerForConditionalGeneration(
    (thinker_to_talker_proj): Linear(in_features=3584, out_features=896, bias=True)
    (model): Qwen2_5OmniTalkerModel(
      (embed_tokens): Embedding(8448, 3584)
      (layers): ModuleList(
        (0-23): 24 x Qwen2_5OmniDecoderLayer(
          (self_attn): Qwen2_5OmniFlashAttention2(
            (q_proj): Linear(in_features=896, out_features=1536, bias=True)
            (k_proj): Linear(in_features=896, out_features=512, bias=True)
            (v_proj): Linear(in_features=896, out_features=512, bias=True)
            (o_proj): Linear(in_features=1536, out_features=896, bias=False)
            (rotary_emb): Qwen2_5OmniRotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=896, out_features=18944, bias=False)
            (up_proj): Linear(in_features=896, out_features=18944, bias=False)
            (down_proj): Linear(in_features=18944, out_features=896, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
          (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        )
      )
      (norm): Qwen2RMSNorm((896,), eps=1e-06)
      (rotary_emb): Qwen2_5OmniRotaryEmbedding()
    )
    (codec_head): Linear(in_features=896, out_features=8448, bias=False)
  )
  (token2wav): Qwen2_5OmniToken2WavModel(
    (code2wav_dit_model): Qwen2_5OmniToken2WavDiTModel(
      (time_embed): DiTTimestepEmbedding(
        (time_embed): SinusPositionEmbedding()
        (time_mlp): ModuleList(
          (0): Linear(in_features=256, out_features=1024, bias=True)
          (1): SiLU()
          (2): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (text_embed): DiTCodecEmbedding(
        (codec_embed): Embedding(8194, 512)
      )
      (input_embed): DiTInputEmbedding(
        (proj): Linear(in_features=912, out_features=1024, bias=True)
        (spk_encoder): ECAPA_TimeDelayNet(
          (blocks): ModuleList(
            (0): TimeDelayNetBlock(
              (conv): Conv1d(80, 256, kernel_size=(5,), stride=(1,), padding=same, padding_mode=reflect)
              (activation): ReLU()
            )
            (1): SqueezeExcitationRes2NetBlock(
              (tdnn1): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (res2net_block): Res2NetBlock(
                (blocks): ModuleList(
                  (0): TimeDelayNetBlock(
                    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=same, dilation=(2,), padding_mode=reflect)
                    (activation): ReLU()
                  )
                )
              )
              (tdnn2): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (se_block): SqueezeExcitationBlock(
                (conv1): Conv1d(256, 64, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (relu): ReLU(inplace=True)
                (conv2): Conv1d(64, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (sigmoid): Sigmoid()
              )
            )
            (2): SqueezeExcitationRes2NetBlock(
              (tdnn1): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (res2net_block): Res2NetBlock(
                (blocks): ModuleList(
                  (0): TimeDelayNetBlock(
                    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=same, dilation=(3,), padding_mode=reflect)
                    (activation): ReLU()
                  )
                )
              )
              (tdnn2): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (se_block): SqueezeExcitationBlock(
                (conv1): Conv1d(256, 64, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (relu): ReLU(inplace=True)
                (conv2): Conv1d(64, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (sigmoid): Sigmoid()
              )
            )
            (3): SqueezeExcitationRes2NetBlock(
              (tdnn1): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (res2net_block): Res2NetBlock(
                (blocks): ModuleList(
                  (0): TimeDelayNetBlock(
                    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=same, dilation=(4,), padding_mode=reflect)
                    (activation): ReLU()
                  )
                )
              )
              (tdnn2): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (se_block): SqueezeExcitationBlock(
                (conv1): Conv1d(256, 64, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (relu): ReLU(inplace=True)
                (conv2): Conv1d(64, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (sigmoid): Sigmoid()
              )
            )
          )
          (mfa): TimeDelayNetBlock(
            (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
            (activation): ReLU()
          )
          (asp): AttentiveStatisticsPooling(
            (tdnn): TimeDelayNetBlock(
              (conv): Conv1d(2304, 64, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
              (activation): ReLU()
            )
            (tanh): Tanh()
            (conv): Conv1d(64, 768, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
          )
          (fc): Conv1d(1536, 128, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
        )
      )
      (rotary_embed): Qwen2_5OmniDiTRotaryEmbedding()
      (transformer_blocks): ModuleList(
        (0-21): 22 x DiTDecoderLayer(
          (attn_norm): Qwen2_5_OmniAdaLayerNormZero(
            (silu): SiLU()
            (linear): Linear(in_features=1024, out_features=6144, bias=True)
            (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=False)
          )
          (attn): DiTAttention(
            (to_q): Linear(in_features=1024, out_features=1024, bias=True)
            (to_k): Linear(in_features=1024, out_features=1024, bias=True)
            (to_v): Linear(in_features=1024, out_features=1024, bias=True)
            (to_out): ModuleList(
              (0): Linear(in_features=1024, out_features=1024, bias=True)
              (1): Dropout(p=0.1, inplace=False)
            )
          )
          (ff_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=False)
          (ff): DiTMLP(
            (ff): ModuleList(
              (0): Linear(in_features=1024, out_features=2048, bias=True)
              (1): GELU(approximate='tanh')
              (2): Dropout(p=0.1, inplace=False)
              (3): Linear(in_features=2048, out_features=1024, bias=True)
            )
          )
        )
      )
      (norm_out): Qwen2_5_OmniAdaLayerNormZero_Final(
        (silu): SiLU()
        (linear): Linear(in_features=1024, out_features=2048, bias=True)
        (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=False)
      )
      (proj_out): Linear(in_features=1024, out_features=80, bias=True)
    )
    (code2wav_bigvgan_model): Qwen2_5OmniToken2WavBigVGANModel(
      (conv_pre): Conv1d(80, 1536, kernel_size=(7,), stride=(1,), padding=(3,))
      (ups): ModuleList(
        (0): ModuleList(
          (0): ConvTranspose1d(1536, 768, kernel_size=(11,), stride=(5,), padding=(3,))
        )
        (1): ModuleList(
          (0): ConvTranspose1d(768, 384, kernel_size=(7,), stride=(3,), padding=(2,))
        )
        (2): ModuleList(
          (0): ConvTranspose1d(384, 192, kernel_size=(4,), stride=(2,), padding=(1,))
        )
        (3): ModuleList(
          (0): ConvTranspose1d(192, 96, kernel_size=(4,), stride=(2,), padding=(1,))
        )
        (4): ModuleList(
          (0): ConvTranspose1d(96, 48, kernel_size=(4,), stride=(2,), padding=(1,))
        )
        (5): ModuleList(
          (0): ConvTranspose1d(48, 24, kernel_size=(4,), stride=(2,), padding=(1,))
        )
      )
      (resblocks): ModuleList(
        (0): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(768, 768, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(768, 768, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(768, 768, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(768, 768, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (1): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(768, 768, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(768, 768, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(768, 768, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(768, 768, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (2): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(768, 768, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(768, 768, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(768, 768, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(768, 768, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (3): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(384, 384, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(384, 384, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(384, 384, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(384, 384, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (4): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(384, 384, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(384, 384, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(384, 384, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(384, 384, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (5): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(384, 384, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(384, 384, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(384, 384, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(384, 384, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (6): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(192, 192, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(192, 192, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(192, 192, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(192, 192, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (7): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(192, 192, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(192, 192, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(192, 192, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(192, 192, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (8): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(192, 192, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(192, 192, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(192, 192, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(192, 192, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (9): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(96, 96, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(96, 96, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(96, 96, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(96, 96, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (10): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(96, 96, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(96, 96, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(96, 96, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(96, 96, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (11): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(96, 96, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(96, 96, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(96, 96, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(96, 96, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (12): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(48, 48, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(48, 48, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(48, 48, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(48, 48, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (13): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(48, 48, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(48, 48, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(48, 48, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(48, 48, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (14): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(48, 48, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(48, 48, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(48, 48, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(48, 48, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (15): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(24, 24, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(24, 24, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(24, 24, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(24, 24, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (16): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(24, 24, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(24, 24, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(24, 24, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(24, 24, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (17): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(24, 24, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(24, 24, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(24, 24, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(24, 24, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
      )
      (activation_post): TorchActivation1d(
        (act): SnakeBeta()
        (upsample): UpSample1d()
        (downsample): DownSample1d()
      )
      (conv_post): Conv1d(24, 1, kernel_size=(7,), stride=(1,), padding=(3,), bias=False)
    )
  )
)

Signed-off-by: weedge <[email protected]>

…alGenerationNew class inherit from Qwen2_5OmniForConditionalGeneration Signed-off-by: weedge <[email protected]>

… to generate fast Signed-off-by: weedge <[email protected]>

…ment_stream Signed-off-by: weedge <[email protected]>

Signed-off-by: weedge <[email protected]>

…den_states Signed-off-by: weedge <[email protected]>

…rmersManualQwen2_5OmniLLM Signed-off-by: weedge <[email protected]>

Signed-off-by: weedge <[email protected]>

- thinekr_genrate_chunk hidden_states_len for modality embedding stream cases: - screen_recording_interaction_stream - screen_recording_interaction_chunk_stream - video_information_extracting_stream - video_information_extracting_chunk_stream - omni_chatting_for_math_stream - omni_chatting_for_music_stream - omni_chatting_for_math_chunk_stream - omni_chatting_for_music_chunk_stream Signed-off-by: weedge <[email protected]>

Signed-off-by: weedge <[email protected]>

…vekitQwen2_5OmniVisionVoiceBot and config deploy on modal Signed-off-by: weedge <[email protected]>

Signed-off-by: weedge <[email protected]>

…indow code2wav with achatbot lib Signed-off-by: weedge <[email protected]>

Signed-off-by: weedge <[email protected]>

…er tasks to run Signed-off-by: weedge <[email protected]>

Signed-off-by: weedge <[email protected]>

weedge added 2 commits April 12, 2025 13:07

feat: add transformers qwen2.5 omni

93d4e7f

Signed-off-by: weedge <[email protected]>

feat: add transformers qwen2.5 omni modal demo

6944130

Signed-off-by: weedge <[email protected]>

gemini-code-assist bot reviewed Apr 12, 2025

View reviewed changes

gemini-code-assist bot suggested changes Apr 12, 2025

View reviewed changes

weedge force-pushed the feat/vision_voice branch from 88d844e to ac829c8 Compare April 12, 2025 07:29

weedge added 10 commits April 12, 2025 15:30

feat: add transformers qwen2.5 omni modal demo

ac829c8

Signed-off-by: weedge <[email protected]>

change run_omni_cases.sh

1549a13

Signed-off-by: weedge <[email protected]>

run batch requests cases

d7a375b

Signed-off-by: weedge <[email protected]>

add qwen2.5 omni web demo on modal

28e00b0

Signed-off-by: weedge <[email protected]>

add async to sync generator

d42ce73

Signed-off-by: weedge <[email protected]>

add src/llm/vllm/qwen2_5omni.py

1d9cb39

Signed-off-by: weedge <[email protected]>

change asr stream

f4f832a

Signed-off-by: weedge <[email protected]>

fix: qwen2.5 omni apply_chat_template system content check

c69a18c

Signed-off-by: weedge <[email protected]>

fix: system content

f68fb91

Signed-off-by: weedge <[email protected]>

feat: add thinker_talker_inference_stream and Qwen2_5OmniForCondition…

d8a2b9d

…alGenerationNew class inherit from Qwen2_5OmniForConditionalGeneration Signed-off-by: weedge <[email protected]>

weedge added AR Flow DiT Omni Omni Modality vocoder modal MLLM multimodal large language models labels Apr 17, 2025

feat: add omni_chatting_stream and qwen2_code2wav torch_compile_model…

995a76d

… to generate fast Signed-off-by: weedge <[email protected]>

weedge self-assigned this Apr 18, 2025

weedge added 5 commits April 19, 2025 21:05

feat: add thinker_chunk_stream omni_chatting_stream omni_chatting_seg…

4f3d181

…ment_stream Signed-off-by: weedge <[email protected]>

fix talker_generate_chunk

0939f81

Signed-off-by: weedge <[email protected]>

feat: use first thinker generate hidden states for talker_inputs_embeds

b771aa6

Signed-off-by: weedge <[email protected]>

feat: use first thinker generate hidden states for talker_inputs_embeds

2ec9a2e

Signed-off-by: weedge <[email protected]>

fix: thinker_generate_chunk thinker_new_hidden_states thinker_new_hid…

7db983c

…den_states Signed-off-by: weedge <[email protected]>

weedge added 25 commits April 20, 2025 23:03

fix: thinker_generate_chunk thinker_new_hidden_states thinker_new_hid…

d22749a

…den_states Signed-off-by: weedge <[email protected]>

feat: add Qwen2_5OmniForConditionalGenerationStreaming model, Transfo…

21c2d9a

…rmersManualQwen2_5OmniLLM Signed-off-by: weedge <[email protected]>

add session history chat

15cc298

Signed-off-by: weedge <[email protected]>

add LivekitAsrQwen2_5OmniVoiceBot LivekitQwen2_5OmniVoiceBot

98359cf

Signed-off-by: weedge <[email protected]>

feat: add qwen2_5omni_asr and unit test

485dbc0

Signed-off-by: weedge <[email protected]>

feat: add LivekitQwen2_5OmniVisionVoiceBot

00ac0d9

Signed-off-by: weedge <[email protected]>

change pyproject add llm_transformers_manual_vision_voice_qwen

cfa39e9

Signed-off-by: weedge <[email protected]>

fix: qwen2_5omni_asr test

026d03b

Signed-off-by: weedge <[email protected]>

fix: qwen2_5omni_asr test

7c7e8c0

Signed-off-by: weedge <[email protected]>

fix: add get_qwen2_5omni_transformers_args

dd3a5e0

Signed-off-by: weedge <[email protected]>

feat: add Qwen2_5OmnVisionVoiceProcessor and bot

e44e2c1

Signed-off-by: weedge <[email protected]>

feat: add Qwen2_5OmnVisionVoiceProcessor and bot

122bb01

Signed-off-by: weedge <[email protected]>

fix: change log

28359e7

Signed-off-by: weedge <[email protected]>

feat: support audio input thinker chunk stream

ce9dd6e

Signed-off-by: weedge <[email protected]>

feat: add image_stream, image_chunk_stream

93a225c

Signed-off-by: weedge <[email protected]>

feat: add thinker_all_talker_stream for only return audio case

4dfdb12

Signed-off-by: weedge <[email protected]>

fix: warmup

f637be1

Signed-off-by: weedge <[email protected]>

feat: add LivekitAsrQwen2_5OmniVoiceBot LivekitQwen2_5OmniVoiceBot Li…

faace45

…vekitQwen2_5OmniVisionVoiceBot and config deploy on modal Signed-off-by: weedge <[email protected]>

fix: thinker chunk stream use stop_strings_per_step

dfa608d

Signed-off-by: weedge <[email protected]>

feat: add text/vision/audio -> chunk text+speech stream use sliding w…

be51a9c

…indow code2wav with achatbot lib Signed-off-by: weedge <[email protected]>

change fastapi_webrtc_qwen2_5omni_vision_voice_bot_serve deploy

d7089f2

Signed-off-by: weedge <[email protected]>

fix: embedding mask and sleep no_stream_sleep_time yield to allow oth…

e44755e

…er tasks to run Signed-off-by: weedge <[email protected]>

change version

d7b1358

Signed-off-by: weedge <[email protected]>

change version

809a61b

Signed-off-by: weedge <[email protected]>

weedge merged commit 6e07986 into main Apr 25, 2025

weedge mentioned this pull request Apr 25, 2025

achatbot add qwen2.5-omni QwenLM/Qwen2.5-Omni#242

Open

weedge mentioned this pull request Sep 13, 2025

feat: add step_audio2 bots(e.g.: AQAA) with function call #190

Merged

weedge mentioned this pull request Sep 26, 2025

feat: add transformers Qwen3Omni thinker chunk stream for livekit webrtc room vision+voice bot #196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add qwen2.5-omni #143

feat: add qwen2.5-omni #143

Uh oh!

weedge commented Apr 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weedge commented Apr 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add qwen2.5-omni #143

feat: add qwen2.5-omni #143

Uh oh!

Conversation

weedge commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

deploy modal fastapi-webrtc serve to run personal bot one by one

reference

code2wav

inference

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weedge commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weedge commented Apr 12, 2025 •

edited

Loading

weedge commented Apr 12, 2025 •

edited

Loading