Skip to content

Conversation

@weedge
Copy link
Collaborator

@weedge weedge commented Apr 12, 2025

feat:

  • add qwen2.5-omni task demo on modal
# run all
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s all

# run streaming cases
## run thinker-only token streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c asr_stream -d L4

## run thinker-only chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c thinker_chunk_stream -d L4

## run thinker talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_stream -d L4

## run  text-> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_segment_stream -d L40s

## run  vision (video with audio) -> text+speech | thinker-chunk talker-token code2wav-chunk streaming
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c screen_recording_interaction_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c video_information_extracting_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_math_chunk_stream -d L40s
curl -s https://raw.githubusercontent.com/ai-bot-pro/achatbot/refs/heads/feat/vision_voice/deploy/modal/src/llm/transformers/run_omni_cases.sh | bash -s -- -s run -m qwen2_5omni -c omni_chatting_for_music_chunk_stream -d L40s



# NOTE: if want to generate speech, need use SPEECH_SYS_PROMPT to generate speech

# asr (audio understanding)
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task universal_audio_understanding

# audio to text and speech
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task voice_chatting

# vision(video no audio) to text
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task video_information_extracting
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task screen_recording_interaction

# vision(video with audio) to text and speech
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_math
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_music

# vision(video with audio) to text and speech with multi rounds chat, but need more GPU memory
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen2_5omni.py --task multi_round_omni_chatting

# batch requests
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen2_5omni.py --task batch_requests

# stream
# text -> text stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task thinker_chunk_stream 
# image -> text stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task image_stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task image_chunk_stream
# audio -> text stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task asr_stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task asr_chunk_stream
# video -> text stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task screen_recording_interaction_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task video_information_extracting_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task video_information_extracting_chunk_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task screen_recording_interaction_chunk_stream

# text -> text + chunk speech stream
IMAGE_GPU=L4 modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_stream

# text -> chunk text+speech stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_segment_stream

# vision(video with audio) -> text + chunk speech stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_math_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_music_stream

# vision(video with audio) -> chunk text+speech stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_math_chunk_stream
IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task omni_chatting_for_music_chunk_stream

# text/vision/audio -> chunk text+speech stream  use sliding window code2wav with achatbot package
ACHATBOT_VERSION=0.0.9.post10 IMAGE_GPU=L40s modal run src/llm/transformers/qwen2_5omni.py --task achatbot_generate
  • add qwen2.5-omni vllm example(thinker_only, thinker2talker2wav, code2wav) on modal
IMAGE_GPU=L40s modal run src/llm/vllm/qwen2_5omni.py --task thinker_only

IMAGE_GPU=L40s modal run src/llm/vllm/qwen2_5omni.py --task thinker2talker2wav
#IMAGE_GPU=L40s:2 modal run src/llm/vllm/qwen2_5omni.py --task thinker2talker2wav --thinker-gpu-memory-utilization 0.9 --talker-gpu-memory-utilization 0.7

# slow with no torch compile
IMAGE_GPU=T4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav
IMAGE_GPU=L4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav

# fast with torch compile
IMAGE_GPU=L4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav --other-cmd-args "--enable-torch-compile"
IMAGE_GPU=L40s modal run src/llm/vllm/qwen2_5omni.py --task code2wav --other-cmd-args "--enable-torch-compile"
IMAGE_GPU=L4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav --other-cmd-args "--enable-torch-compile --odeint-method euler"
IMAGE_GPU=L4 modal run src/llm/vllm/qwen2_5omni.py --task code2wav --other-cmd-args "--enable-torch-compile --multi-waveforms"
  • add qwen2_code2wav streaming from vllm, change it for achatbot (cfm dit + bigvgan) (maybe add zmq as connector)

  • add qwen2_5omni_asr and unit test

LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \
    THINKER_LLM_GEN_TEMPERATURE=0.9 \
    LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \
    python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe_stream

LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \
    THINKER_LLM_GEN_TEMPERATURE=0.9 \
    LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \
    python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe

LLM_MODEL_NAME_OR_PATH=./models/Qwen/Qwen2.5-Omni-7B \
    THINKER_LLM_GEN_TEMPERATURE=0.9 \
    LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \
    python -m unittest test.modules.speech.asr.test_qwen2_5omni_asr.TestQwen2_5OmniASR.test_transcribe_with_bytes

deploy modal fastapi-webrtc serve to run personal bot one by one

  • vision bot
    run webrtc_qwen2_5omni_vision_voice_bot serve with webrtc
# webrtc_vision_bot serve on qwen2.5omni vision llm 
IMAGE_NAME=qwen2.5omni IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L4 modal serve -e achatbot src/fastapi_webrtc_vision_bot_serve.py

curl api to run chat room bot with webrtc (daily/livekit/agora) use livekit_room

curl --location 'https://weedge-achatbot--fastapi-webrtc-vision-qwen2-5omni-b-4cb328-dev.modal.run/bot_join/chat-room/LivekitDescribeVisionBot' \
--header 'Content-Type: application/json' \
--data '{
  "chat_bot_name": "LivekitDescribeVisionBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitDescribeVisionBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "asr": "sense_voice",
    "llm": "llm_transformers_manual_qwen2_5omni_vision",
    "tts": "edge"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": { "stop_secs": 0.7 }
    },
    "asr": {
      "tag": "sense_voice_asr",
      "args": {
        "language": "zn",
        "model_name_or_path": "/root/.achatbot/models/FunAudioLLM/SenseVoiceSmall"
      }
    },
    "llm": {
      "tag": "llm_transformers_manual_qwen2_5omni_vision",
      "args": {
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 1,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151644, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
      }
    },
    "tts": {
      "tag": "tts_edge",
      "args": {
        "voice_name": "zh-CN-YunjianNeural",
        "language": "zh",
        "gender": "Male"
      }
    }
  },
  "config_list": []
}
'
  • vision voice bot

run webrtc_qwen2_5omni_vision_voice_bot serve with webrtc

# webrtc_audio_bot serve on default pip image
# need create .env.example to modal Secrets for webrtc key
IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L40s modal serve -e achatbot src/fastapi_webrtc_qwen2_5omni_vision_voice_bot_serve.py

curl api to run chat room bot with webrtc (livekit_room)

# thinker gen chunk token and hidden states -> talker gen vq codes token -> code2wav gen chunk wav | don't use_sliding_window_code2wav
curl --location 'https://weedge-achatbot--fastapi-webrtc-qwen2-5omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitQwen2_5OmniVisionVoiceBot' \
--header 'Content-Type: application/json' \
--data '{
  "chat_bot_name": "LivekitQwen2_5OmniVisionVoiceBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitQwen2_5OmniVisionVoiceBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "omni_llm": "llm_transformers_manual_qwen2_5omni_vision_voice"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": { "stop_secs": 0.7 }
    },
    "omni_llm": {
      "tag": "llm_transformers_manual_qwen2_5omni_vision_voice",
      "args": {
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 1,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151644, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 2048,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_skip_thinker_token_ids": [],
        "talker_eos_token_ids": [8292, 8294],
        "code2wav_args": {
          "model_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B",
          "enable_torch_compile": false,
          "enable_torch_compile_first_chunk": false,
          "odeint_method": "euler",
          "odeint_method_relaxed": false,
          "batched_chunk": 3,
          "frequency": "50hz",
          "device": "cuda",
          "num_steps": 10,
          "guidance_scale": 0.5,
          "sway_coefficient": -1.0,
          "code2wav_dynamic_batch": false
        },
        "speaker": "Chelsie",
        "is_use_sliding_window_code2wav": false,
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
      }
    }
  },
  "config_list": []
}
'
# thinker gen chunk token and hidden states -> talker gen vq codes token -> code2wav gen chunk wav | use_sliding_window_code2wav | no torch.compile
curl --location 'https://weedge-achatbot--fastapi-webrtc-qwen2-5omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitQwen2_5OmniVisionVoiceBot' \
--header 'Content-Type: application/json' \
--data '{
  "chat_bot_name": "LivekitQwen2_5OmniVisionVoiceBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitQwen2_5OmniVisionVoiceBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "omni_llm": "llm_transformers_manual_qwen2_5omni_vision_voice"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": { "stop_secs": 0.7 }
    },
    "omni_llm": {
      "tag": "llm_transformers_manual_qwen2_5omni_vision_voice",
      "args": {
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 1,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151644, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 2048,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_skip_thinker_token_ids": [],
        "talker_eos_token_ids": [8292, 8294],
        "code2wav_args": {
          "model_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B",
          "enable_torch_compile": false,
          "enable_torch_compile_first_chunk": false,
          "odeint_method": "euler",
          "odeint_method_relaxed": false,
          "batched_chunk": 3,
          "frequency": "50hz",
          "device": "cuda",
          "num_steps": 10,
          "guidance_scale": 0.5,
          "sway_coefficient": -1.0,
          "code2wav_dynamic_batch": false
        },
        "speaker": "Chelsie",
        "is_use_sliding_window_code2wav": true,
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
      }
    }
  },
  "config_list": []
}
'
  • qwen2.5omni voice bot (speech->text + speech)
curl --location 'https://weedge-achatbot--fastapi-webrtc-qwen2-5omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitQwen2_5OmniVoiceBot' \
--header 'Content-Type: application/json' \
--data '{
  "chat_bot_name": "LivekitQwen2_5OmniVoiceBot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitQwen2_5OmniVoiceBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "voice_llm": "llm_transformers_manual_qwen2_5omni_audio_voice"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": { "stop_secs": 0.7 }
    },
    "voice_llm": {
      "tag": "llm_transformers_manual_qwen2_5omni_audio_voice",
      "args": {
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 1,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151644, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 2048,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_skip_thinker_token_ids": [],
        "talker_eos_token_ids": [8292, 8294],
        "code2wav_args": {
          "model_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B",
          "enable_torch_compile": false,
          "enable_torch_compile_first_chunk": false,
          "odeint_method": "euler",
          "odeint_method_relaxed": false,
          "batched_chunk": 3,
          "frequency": "50hz",
          "device": "cuda",
          "num_steps": 10,
          "guidance_scale": 0.5,
          "sway_coefficient": -1.0,
          "code2wav_dynamic_batch": false
        },
        "speaker": "Chelsie",
        "is_use_sliding_window_code2wav": true,
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
      }
    }
  },
  "config_list": []
}
'
  • asr + qwen2.5omni voice bot (text -> text + speech)
curl --location 'https://weedge-achatbot--fastapi-webrtc-qwen2-5omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitAsrQwen2_5OmniVoiceBot' \
--header 'Content-Type: application/json' \
--data '{
    "chat_bot_name": "LivekitAsrQwen2_5OmniVoiceBot",
    "room_name": "chat-room",
    "room_url": "",
    "token": "",
    "room_manager": {
        "tag": "livekit_room",
        "args": {
            "bot_name": "LivekitAsrQwen2_5OmniVoiceBot",
            "is_common_session": false
        }
    },
    "services": {
        "pipeline": "achatbot",
        "vad": "silero",
        "asr": "sense_voice",
        "voice_llm": "llm_transformers_manual_qwen2_5omni_text_voice"
    },
    "config": {
        "vad": {
            "tag": "silero_vad_analyzer",
            "args": {
                "stop_secs": 0.7
            }
        },
        "asr": {
            "args": {
                "language": "zn",
                "model_name_or_path": "/root/.achatbot/models/FunAudioLLM/SenseVoiceSmall"
            },
            "tag": "sense_voice_asr"
        },
        "voice_llm": {
            "tag": "llm_transformers_manual_qwen2_5omni_text_voice",
            "args": {
                "lm_device": "cuda",
                "lm_torch_dtype": "bfloat16",
                "lm_attn_impl": "flash_attention_2",
                "warmup_steps": 1,
                "chat_history_size": 0,
                "thinker_eos_token_ids": [
                    151644,
                    151645
                ],
                "thinker_args": {
                    "lm_gen_temperature": 0.95,
                    "lm_gen_top_k": 20,
                    "lm_gen_top_p": 0.9,
                    "lm_gen_min_new_tokens": 1,
                    "lm_gen_max_new_tokens": 1024,
                    "lm_gen_max_tokens_per_step": 10,
                    "lm_gen_repetition_penalty": 1.1
                },
                "talker_args": {
                    "lm_gen_temperature": 0.95,
                    "lm_gen_top_k": 20,
                    "lm_gen_top_p": 0.9,
                    "lm_gen_min_new_tokens": 1,
                    "lm_gen_max_new_tokens": 2048,
                    "lm_gen_repetition_penalty": 1.1
                },
                "talker_skip_thinker_token_ids": [],
                "talker_eos_token_ids": [
                    8292,
                    8294
                ],
                "code2wav_args": {
                    "model_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B",
                    "enable_torch_compile": false,
                    "enable_torch_compile_first_chunk": false,
                    "odeint_method": "euler",
                    "odeint_method_relaxed": false,
                    "batched_chunk": 3,
                    "frequency": "50hz",
                    "device": "cuda",
                    "num_steps": 10,
                    "guidance_scale": 0.5,
                    "sway_coefficient": -1.0,
                    "code2wav_dynamic_batch": false
                },
                "speaker": "Chelsie",
                "is_use_sliding_window_code2wav": true,
                "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen2.5-Omni-7B"
            }
        }
    },
    "config_list": []
}'

image

TMRoPE (Time-aligned Multimodal RoPE):

image

talker -> code --> cfm dit->mel --> bigvgan -> waveforms streaming: (generate source code see vllm)

image


reference

code2wav

inference


Tip

  • 在推理工程架构上可以将服务分成 text/audio/image/video tokenizer, thinker, talker, 以及 code2wav 模块,(直接从整体权重中加载模块对应权重,也可以写一个脚本把权重拆分重新保存,然后单独加载),以便在架构上异构利用不同的计算存储资源,提高推理的延迟和吞吐;也便于模块的单独测试维护。(thinker, talker这些llm结构推理,则利用已有的并行推理, 以及推理优化策略)
  • 现在的omni(vision+voice)是对已有模块加入多模态token对齐的冷冻,焊接训练,thinker 和 talker 通过 projector hidden states 对接上, thinker模块主打LLM, 对于架构而言无本质变化
  • transformers 支持的qwen2.5omni dit cfm 仅支持 RungeKutta4ODESolver(rk4)
  • vllm 和 sglang推理框架的支持都还不完善,后续对 thinker 和 talker LM 加上推理优化(chunked prefills)
  • 官方fork 的vllm代码和transformers支持的qwen2.5omni 都不支持 text 和 audio的双流同时输出,因为talker需要thinker的complete finish输出的 hidden states 作为prompt input, 以及embedding; 仅支持单独 thinker text stream 和 talker audio stream (code2wav(dit flow + bigvgan) 类似 F5-TTS);TTS 任务需要加入指令数据集进行微调下(但是执行链条变长了,和minicpmo类似)
    • minicpmo 支持text 和audio的双流同时输出,llm generate的时候,每次生成最大的token数是3,然后给后续模块生成对应音频,每次最多生成3个,为了避免kv重新计算,需要使用prefill + kvcache, 见: https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/modeling_minicpmo.py#L1232 ; 同理qwen2.5omni也可以按照thinker llm生成的tokens片段来生成音频,结合滑动上下文窗口; 在原有的基础上加上分片逻辑即可,(类似大数据中的实时流式处理,同一个道理,本质是一样的, 序列数据变成了矩阵形式的数据(本质上都是序列数据),所以大数据处理流程中也可以引入处理矩阵的数据,结合硬件并行加速矩阵计算,比如tensorRT)
    • 演变过程: 『 inference no streaming 』 => 『 thinker-only streaming 』 => 『 segment thinker-only streaming 』=> 『 thinker -> (talker + code2wav chunk) streaming 』=> 『 segment thinker -> segment (talker + code2wav chunk) streaming 』 => 『 segment thinker -> Concurrency Batch segment (talker + code2wav chunk) streaming 』=> Concurrency(online) / Batch(offline) 『 segment thinker -> Concurrency Batch segment (talker + code2wav chunk) streaming 』
    • 演变成『 segment thinker -> segment (talker + code2wav chunk) streaming 』 TTFT(chunk)的延迟减少,但是会额外多出一部分显存(缓存)来存放一些上下文信息,空间换时间
    • qwen2.5omni『 segment thinker -> segment (talker + code2wav chunk) streaming 』 生成的语音片段,生成质量不连贯,如果最求语音质量, 不需要按文本流式输出场景,可以使用『 thinker -> (talker + code2wav chunk) streaming 』的模式,『 segment thinker -> segment (talker + code2wav chunk) streaming 』 模式的参数 thinker_max_new_tokens = thinker_max_tokens_per_step, 则整体thinker生成完之后,才进行语音片段流式输出。
    • 对于 qwen2.5omni『 segment thinker -> segment (talker + code2wav chunk) streaming 』 生成的语音片段,生成质量不连贯的case,需要调整thinker_max_tokens_per_step 参数,以及 thinker_eos_token_ids 加入断句的token_id, 和 TTS 一样,尽可能一次生成的语音是一段文本。
    • 对生成的文本分片之后,后续生成语音,可以并发批量生成,提高吞吐,本质上套用了数据处理中高吞吐优化方法,批量会占用更多的存储空间(内存或者显存)
        # talker's structure of prompt tokens, embeddings and thinker_reply_part:
        #
        #   tokens: [input_tokens] + [codec_pad_token] + [codec_bos_token]
        #   embeddings: [input_embeds] + [text_bos_token] + [thinker_reply_part[0]]
        #   thinker_reply_part: [thinker_reply_part[1:]] + [text_eos_token] + [text_pad_token]

Note

use vllm inference:

  • thinker LM: model weights take 16.73GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 5.48GiB; the rest of the memory reserved for KV Cache, so the total memory reserved for the model is 22.3 GiB. must thinker-gpu-memory-utilization * total_gpu_memory > 22.3 GiB
  • talker LM: model weights take 2.55GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 4.36GiB; the rest of the memory reserved for KV Cache, so the total memory reserved for the model is 6.9 GiB. must talker-gpu-memory-utilization * total_gpu_memory > 6.9 GiB

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a new feature to run Qwen2.5-Omni, a multimodal model, on Modal. It includes the necessary code to set up the environment, load the model, and run inference on various tasks such as audio understanding, voice chatting, and video information extraction. The pull request also adds a script to automate the process of downloading models and assets, as well as running the different test cases.

Highlights

  • Qwen2.5-Omni Integration: Adds support for running Qwen2.5-Omni, a multimodal model, on the Modal platform.
  • Task Demos: Includes demos for various tasks such as universal audio understanding, voice chatting, video information extraction, screen recording interaction, and omni-chatting for math and music.
  • Automated Script: Introduces a shell script to automate downloading models/assets and running test cases with different configurations (GPU, task, etc.).

Changelog

Click here to see the changelog
  • .gitignore
    • Added *.mp4 to the ignored files list on line 165.
    • Ensured *.xml is included in the git repository on line 210.
  • deploy/modal/src/llm/transformers/qwen2_5omni.py
    • Introduces a new Modal app for Qwen2.5-Omni.
    • Sets up the environment with necessary dependencies (transformers, torch, flash-attn, etc.).
    • Defines functions for various multimodal tasks, including audio understanding, voice chatting, and video information extraction.
    • Implements an inference function to process and generate responses based on different input types (audio, images, videos).
    • Adds a main function to run the different tasks based on user input.
  • deploy/modal/src/llm/transformers/run_omni_cases.sh
    • Introduces a shell script to automate the process of downloading models and assets.
    • Provides command-line arguments to configure the GPU, task, model type, and transformers commit.
    • Includes functions to run different test cases with specified configurations.
    • Adds a usage function to display help information.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A model of sight and sound,
Qwen's Omni, profound.
Modal's cloud takes flight,
Processing day and night,
New AI wonders abound.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Qwen2.5-Omni model to the modal deployment, including necessary dependencies, configuration, and example tasks. The code appears well-structured and includes several example use cases. However, there are a few areas that could be improved for clarity and maintainability.

Summary of Findings

  • Missing Error Handling: The subprocess.run calls in qwen2_5omni.py lack error handling. If these commands fail, the script will continue, potentially leading to incorrect results or unexpected behavior. Consider adding error checking to ensure the commands execute successfully.
  • Hardcoded Paths: The paths HF_MODEL_DIR and ASSETS_DIR are hardcoded in qwen2_5omni.py. It would be better to make these configurable via environment variables to allow for more flexible deployment.
  • Inconsistent Use of use_audio_in_video: The use_audio_in_video parameter is used inconsistently across different function calls in qwen2_5omni.py. Ensure that this parameter is used correctly and consistently to avoid unexpected behavior.

Merge Readiness

The pull request introduces a significant new feature and includes example tasks, which is commendable. However, the missing error handling and hardcoded paths should be addressed before merging. I am unable to directly approve this pull request, and recommend that other reviewers also examine this code before merging. At a minimum, the high severity issues should be addressed before merging.

@weedge
Copy link
Collaborator Author

weedge commented Apr 12, 2025

Qwen2.5Omni: 10732.225408 M parameters

Qwen2_5OmniForConditionalGeneration(
  (thinker): Qwen2_5OmniThinkerForConditionalGeneration(
    (audio_tower): Qwen2_5OmniAudioEncoder(
      (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (positional_embedding): SinusoidsPositionEmbedding()
      (audio_bos_eos_token): Embedding(2, 3584)
      (layers): ModuleList(
        (0-31): 32 x Qwen2_5OmniAudioEncoderLayer(
          (self_attn): Qwen2_5OmniAudioFlashAttention2(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bias=True)
          (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        )
      )
      (ln_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (avg_pooler): AvgPool1d(kernel_size=(2,), stride=(2,), padding=(0,))
      (proj): Linear(in_features=1280, out_features=3584, bias=True)
    )
    (visual): Qwen2_5OmniVisionEncoder(
      (patch_embed): Qwen2_5_VisionPatchEmbed(
        (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
      )
      (rotary_pos_emb): Qwen2_5_VisionRotaryEmbedding()
      (blocks): ModuleList(
        (0-31): 32 x Qwen2_5OmniVisionBlock(
          (norm1): Qwen2RMSNorm((1280,), eps=1e-06)
          (norm2): Qwen2RMSNorm((1280,), eps=1e-06)
          (attn): Qwen2_5OmniVisionFlashAttention2(
            (q): Linear(in_features=1280, out_features=1280, bias=True)
            (k): Linear(in_features=1280, out_features=1280, bias=True)
            (v): Linear(in_features=1280, out_features=1280, bias=True)
            (proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (mlp): Qwen2_5OmniMLP(
            (gate_proj): Linear(in_features=1280, out_features=3420, bias=True)
            (up_proj): Linear(in_features=1280, out_features=3420, bias=True)
            (down_proj): Linear(in_features=3420, out_features=1280, bias=True)
            (act_fn): SiLU()
          )
        )
      )
      (merger): Qwen2_5OmniPatchMerger(
        (ln_q): Qwen2RMSNorm((1280,), eps=1e-06)
        (mlp): Sequential(
          (0): Linear(in_features=5120, out_features=5120, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=5120, out_features=3584, bias=True)
        )
      )
    )
    (model): Qwen2_5OmniThinkerTextModel(
      (embed_tokens): Embedding(152064, 3584)
      (layers): ModuleList(
        (0-27): 28 x Qwen2_5OmniDecoderLayer(
          (self_attn): Qwen2_5OmniFlashAttention2(
            (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
            (k_proj): Linear(in_features=3584, out_features=512, bias=True)
            (v_proj): Linear(in_features=3584, out_features=512, bias=True)
            (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
            (rotary_emb): Qwen2_5OmniRotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
          (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        )
      )
      (norm): Qwen2RMSNorm((3584,), eps=1e-06)
      (rotary_emb): Qwen2_5OmniRotaryEmbedding()
    )
    (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
  )
  (talker): Qwen2_5OmniTalkerForConditionalGeneration(
    (thinker_to_talker_proj): Linear(in_features=3584, out_features=896, bias=True)
    (model): Qwen2_5OmniTalkerModel(
      (embed_tokens): Embedding(8448, 3584)
      (layers): ModuleList(
        (0-23): 24 x Qwen2_5OmniDecoderLayer(
          (self_attn): Qwen2_5OmniFlashAttention2(
            (q_proj): Linear(in_features=896, out_features=1536, bias=True)
            (k_proj): Linear(in_features=896, out_features=512, bias=True)
            (v_proj): Linear(in_features=896, out_features=512, bias=True)
            (o_proj): Linear(in_features=1536, out_features=896, bias=False)
            (rotary_emb): Qwen2_5OmniRotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=896, out_features=18944, bias=False)
            (up_proj): Linear(in_features=896, out_features=18944, bias=False)
            (down_proj): Linear(in_features=18944, out_features=896, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
          (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        )
      )
      (norm): Qwen2RMSNorm((896,), eps=1e-06)
      (rotary_emb): Qwen2_5OmniRotaryEmbedding()
    )
    (codec_head): Linear(in_features=896, out_features=8448, bias=False)
  )
  (token2wav): Qwen2_5OmniToken2WavModel(
    (code2wav_dit_model): Qwen2_5OmniToken2WavDiTModel(
      (time_embed): DiTTimestepEmbedding(
        (time_embed): SinusPositionEmbedding()
        (time_mlp): ModuleList(
          (0): Linear(in_features=256, out_features=1024, bias=True)
          (1): SiLU()
          (2): Linear(in_features=1024, out_features=1024, bias=True)
        )
      )
      (text_embed): DiTCodecEmbedding(
        (codec_embed): Embedding(8194, 512)
      )
      (input_embed): DiTInputEmbedding(
        (proj): Linear(in_features=912, out_features=1024, bias=True)
        (spk_encoder): ECAPA_TimeDelayNet(
          (blocks): ModuleList(
            (0): TimeDelayNetBlock(
              (conv): Conv1d(80, 256, kernel_size=(5,), stride=(1,), padding=same, padding_mode=reflect)
              (activation): ReLU()
            )
            (1): SqueezeExcitationRes2NetBlock(
              (tdnn1): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (res2net_block): Res2NetBlock(
                (blocks): ModuleList(
                  (0): TimeDelayNetBlock(
                    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=same, dilation=(2,), padding_mode=reflect)
                    (activation): ReLU()
                  )
                )
              )
              (tdnn2): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (se_block): SqueezeExcitationBlock(
                (conv1): Conv1d(256, 64, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (relu): ReLU(inplace=True)
                (conv2): Conv1d(64, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (sigmoid): Sigmoid()
              )
            )
            (2): SqueezeExcitationRes2NetBlock(
              (tdnn1): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (res2net_block): Res2NetBlock(
                (blocks): ModuleList(
                  (0): TimeDelayNetBlock(
                    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=same, dilation=(3,), padding_mode=reflect)
                    (activation): ReLU()
                  )
                )
              )
              (tdnn2): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (se_block): SqueezeExcitationBlock(
                (conv1): Conv1d(256, 64, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (relu): ReLU(inplace=True)
                (conv2): Conv1d(64, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (sigmoid): Sigmoid()
              )
            )
            (3): SqueezeExcitationRes2NetBlock(
              (tdnn1): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (res2net_block): Res2NetBlock(
                (blocks): ModuleList(
                  (0): TimeDelayNetBlock(
                    (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=same, dilation=(4,), padding_mode=reflect)
                    (activation): ReLU()
                  )
                )
              )
              (tdnn2): TimeDelayNetBlock(
                (conv): Conv1d(256, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (activation): ReLU()
              )
              (se_block): SqueezeExcitationBlock(
                (conv1): Conv1d(256, 64, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (relu): ReLU(inplace=True)
                (conv2): Conv1d(64, 256, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
                (sigmoid): Sigmoid()
              )
            )
          )
          (mfa): TimeDelayNetBlock(
            (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
            (activation): ReLU()
          )
          (asp): AttentiveStatisticsPooling(
            (tdnn): TimeDelayNetBlock(
              (conv): Conv1d(2304, 64, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
              (activation): ReLU()
            )
            (tanh): Tanh()
            (conv): Conv1d(64, 768, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
          )
          (fc): Conv1d(1536, 128, kernel_size=(1,), stride=(1,), padding=same, padding_mode=reflect)
        )
      )
      (rotary_embed): Qwen2_5OmniDiTRotaryEmbedding()
      (transformer_blocks): ModuleList(
        (0-21): 22 x DiTDecoderLayer(
          (attn_norm): Qwen2_5_OmniAdaLayerNormZero(
            (silu): SiLU()
            (linear): Linear(in_features=1024, out_features=6144, bias=True)
            (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=False)
          )
          (attn): DiTAttention(
            (to_q): Linear(in_features=1024, out_features=1024, bias=True)
            (to_k): Linear(in_features=1024, out_features=1024, bias=True)
            (to_v): Linear(in_features=1024, out_features=1024, bias=True)
            (to_out): ModuleList(
              (0): Linear(in_features=1024, out_features=1024, bias=True)
              (1): Dropout(p=0.1, inplace=False)
            )
          )
          (ff_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=False)
          (ff): DiTMLP(
            (ff): ModuleList(
              (0): Linear(in_features=1024, out_features=2048, bias=True)
              (1): GELU(approximate='tanh')
              (2): Dropout(p=0.1, inplace=False)
              (3): Linear(in_features=2048, out_features=1024, bias=True)
            )
          )
        )
      )
      (norm_out): Qwen2_5_OmniAdaLayerNormZero_Final(
        (silu): SiLU()
        (linear): Linear(in_features=1024, out_features=2048, bias=True)
        (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=False)
      )
      (proj_out): Linear(in_features=1024, out_features=80, bias=True)
    )
    (code2wav_bigvgan_model): Qwen2_5OmniToken2WavBigVGANModel(
      (conv_pre): Conv1d(80, 1536, kernel_size=(7,), stride=(1,), padding=(3,))
      (ups): ModuleList(
        (0): ModuleList(
          (0): ConvTranspose1d(1536, 768, kernel_size=(11,), stride=(5,), padding=(3,))
        )
        (1): ModuleList(
          (0): ConvTranspose1d(768, 384, kernel_size=(7,), stride=(3,), padding=(2,))
        )
        (2): ModuleList(
          (0): ConvTranspose1d(384, 192, kernel_size=(4,), stride=(2,), padding=(1,))
        )
        (3): ModuleList(
          (0): ConvTranspose1d(192, 96, kernel_size=(4,), stride=(2,), padding=(1,))
        )
        (4): ModuleList(
          (0): ConvTranspose1d(96, 48, kernel_size=(4,), stride=(2,), padding=(1,))
        )
        (5): ModuleList(
          (0): ConvTranspose1d(48, 24, kernel_size=(4,), stride=(2,), padding=(1,))
        )
      )
      (resblocks): ModuleList(
        (0): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(768, 768, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(768, 768, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(768, 768, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(768, 768, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (1): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(768, 768, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(768, 768, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(768, 768, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(768, 768, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (2): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(768, 768, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(768, 768, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(768, 768, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(768, 768, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (3): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(384, 384, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(384, 384, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(384, 384, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(384, 384, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (4): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(384, 384, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(384, 384, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(384, 384, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(384, 384, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (5): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(384, 384, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(384, 384, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(384, 384, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(384, 384, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (6): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(192, 192, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(192, 192, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(192, 192, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(192, 192, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (7): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(192, 192, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(192, 192, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(192, 192, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(192, 192, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (8): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(192, 192, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(192, 192, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(192, 192, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(192, 192, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (9): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(96, 96, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(96, 96, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(96, 96, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(96, 96, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (10): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(96, 96, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(96, 96, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(96, 96, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(96, 96, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (11): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(96, 96, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(96, 96, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(96, 96, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(96, 96, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (12): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(48, 48, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(48, 48, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(48, 48, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(48, 48, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (13): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(48, 48, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(48, 48, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(48, 48, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(48, 48, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (14): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(48, 48, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(48, 48, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(48, 48, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(48, 48, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (15): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(24, 24, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): Conv1d(24, 24, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,))
            (2): Conv1d(24, 24, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(24, 24, kernel_size=(3,), stride=(1,), padding=(1,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (16): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(24, 24, kernel_size=(7,), stride=(1,), padding=(3,))
            (1): Conv1d(24, 24, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,))
            (2): Conv1d(24, 24, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(24, 24, kernel_size=(7,), stride=(1,), padding=(3,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
        (17): AMPBlock(
          (convs1): ModuleList(
            (0): Conv1d(24, 24, kernel_size=(11,), stride=(1,), padding=(5,))
            (1): Conv1d(24, 24, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,))
            (2): Conv1d(24, 24, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,))
          )
          (convs2): ModuleList(
            (0-2): 3 x Conv1d(24, 24, kernel_size=(11,), stride=(1,), padding=(5,))
          )
          (activations): ModuleList(
            (0-5): 6 x TorchActivation1d(
              (act): SnakeBeta()
              (upsample): UpSample1d()
              (downsample): DownSample1d()
            )
          )
        )
      )
      (activation_post): TorchActivation1d(
        (act): SnakeBeta()
        (upsample): UpSample1d()
        (downsample): DownSample1d()
      )
      (conv_post): Conv1d(24, 1, kernel_size=(7,), stride=(1,), padding=(3,), bias=False)
    )
  )
)

@weedge weedge force-pushed the feat/vision_voice branch from 88d844e to ac829c8 Compare April 12, 2025 07:29
@weedge weedge added AR Flow DiT Omni Omni Modality vocoder modal MLLM multimodal large language models labels Apr 17, 2025
@weedge weedge self-assigned this Apr 18, 2025
weedge added 25 commits April 20, 2025 23:03
Signed-off-by: weedge <[email protected]>
- thinekr_genrate_chunk hidden_states_len for modality embedding

stream cases:
- screen_recording_interaction_stream
- screen_recording_interaction_chunk_stream
- video_information_extracting_stream
- video_information_extracting_chunk_stream
- omni_chatting_for_math_stream
- omni_chatting_for_music_stream
- omni_chatting_for_math_chunk_stream
- omni_chatting_for_music_chunk_stream

Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
…vekitQwen2_5OmniVisionVoiceBot and config deploy on modal

Signed-off-by: weedge <[email protected]>
…indow code2wav with achatbot lib

Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AR DiT Flow MLLM multimodal large language models modal Omni Omni Modality vocoder

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants