You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to ask you, whether SpeechGPT could understand mixed text + audio input. In case it does, how is the user prompt to be written?
Background for these questions: Large Audio LM benchmarks like AIR-Bench require mixed input processing; instructions are given as text which are passed to the model together with an audio signal that is to be analysed or classified. Since SpeechGPT made it to the Chat Leaderboard, I assume that such mixed input processing is in principle possible.
Using your interface cli_infer.py I prompted your model as follows: based on your preprocess method code I separated the text instruction and audio file name by the string "is input:" like so: TEXT INSTRUCTION is input: PATH_TO_AUDIOFILE.wav. Which was not successful, since all model answers were unrelated to both instruction and audio content.
Could you let me know or point me to a documentation on whether there's another way to provide such mixed input? Thanks a lot!
The text was updated successfully, but these errors were encountered:
Hi, thanks a lot for sharing your model and code!
I'd like to ask you, whether SpeechGPT could understand mixed text + audio input. In case it does, how is the user prompt to be written?
Background for these questions: Large Audio LM benchmarks like AIR-Bench require mixed input processing; instructions are given as text which are passed to the model together with an audio signal that is to be analysed or classified. Since SpeechGPT made it to the Chat Leaderboard, I assume that such mixed input processing is in principle possible.
Using your interface cli_infer.py I prompted your model as follows: based on your preprocess method code I separated the text instruction and audio file name by the string "is input:" like so:
TEXT INSTRUCTION is input: PATH_TO_AUDIOFILE.wav
. Which was not successful, since all model answers were unrelated to both instruction and audio content.Could you let me know or point me to a documentation on whether there's another way to provide such mixed input? Thanks a lot!
The text was updated successfully, but these errors were encountered: