Keeping multiple language transcription #49

dehma01 · 2022-09-22T23:30:20Z

dehma01
Sep 22, 2022

Hello,
I have a small issue with the transcription:
I have an audio where 2 people are speaking 2 different languages - French and English - and I would like to get the raw transcription of this audio in the spoken language.
However when I run the model, it automatically translates the whole audio in french (french being the detected language as the audio starts in french).

Is there a way to remove this "translation" feature ?
Thx

severian3321 · 2022-09-23T03:59:40Z

severian3321
Sep 23, 2022

I haven't tried it myself. Perhaps you could have two different whisper python codes running simultaneously. One detecting French and one detecting English. Afterwords run both transcripts in some grammar detector, that removes wrong grammars and spellings (i.e. the respectively wrong language). Then merge both text files.

On the other hand there must be a python library that detects voice signatures out there.

It sounds like a lot of work though.

3 replies

dehma01 Sep 23, 2022
Author

Not very scalable approach imo, because you don't know in advance the number of speakers in an audio.
I thought about launching the script a first time, it will output small chuncks of timestamp with content. Then I cut the original audio respecting these chunks. And finally looping through these chunks and detecting the language for each of them.

However I think it should exist an easier way because at the end the model had to transcript the english part of the audio in english before translating it in french.

noamzbr Sep 23, 2022

I also encountered some confusing behaviour when transcribing hebrew speach - of a single speaker this time. I noticed that sometimes the output text is all in hebrew, sometimes all in english (the model translates everything), and sometimes (when using english words as part of a hebrew sentance) it will return a mixed language transcription. Didn't find any way to control this behaviour...

All of this was generated using whisper input.mp3 --model medium --language haw

dehma01 Sep 23, 2022
Author

I have noticed that using larger models like--model large tends to clear up mixed language transcription.
But indeed I am still looking for a way to control this behavior.

TenkiLiu · 2023-02-09T04:31:45Z

TenkiLiu
Feb 9, 2023

Hello, I have same problems on it, i only want it transcribe different languages in same file, no need translate it . Did you findout how to do it already? except to cute the audio to small part then transcribe it then combine them again. too troble way. Thank you.

1 reply

furyphoenix Apr 3, 2023

Same problem.
When two languages are mixed, sometimes the original language can be kept normally, sometimes it will be fully translated into one, and sometimes when it appears from another language, all subsequent translations will be translated into this language.
Hope this behavior can be controlled. In some scenarios, there is really no need for translation, only for recording.

aaimnr · 2023-03-25T10:56:57Z

aaimnr
Mar 25, 2023

I don't think it translates to French - it probably mistakenly transcribes English as if it was French, so just comes up with nonsensical sentences.

1 reply

furyphoenix Apr 3, 2023

In many cases it does translate. Confirmed by audio.

Galera-Co · 2023-03-29T11:03:30Z

Galera-Co
Mar 29, 2023

I also faced with this issue and it seems that Whisper really transcribe and THEN translate the output.
In my case is the audio mostly in Ukrainian with only a few first phrases in English and then Whisper will translate everything to English.
I'm using Wisper API with no language-param set.

3 replies

TenkiLiu Mar 29, 2023

If its only english at the beginning part, and the rest are Ukrainian , then there is a parameter like language="your language " to force it just output Ukrainian. you can try . ex: result = model.transcribe(audio,verbose=True, language="japanese")

Galera-Co Mar 29, 2023

Yeah, i do it that way.
But i need to process a meetings that could be on Ukrainian, russian or English, so no constant language.
But you are right, I saw a tool somewhere that could detect the language of the audio, it's possible to feed it pieces from different parts of the video and take the most common one as main

leophill Mar 29, 2023

Hi,

This is a typical code-switched ASR scenario (a challenging research topic). I don't think that Whisper was designed with code-switching in mind as it assumes that the full audio is in a single language through the initial language detection for the first 30s.

Segmenting the audio using a VAD model and then running the transcription on each segment could be a workaround, although not perfect, as it might not always work with intra-sentential code switching.

fabswt · 2023-04-13T16:57:11Z

fabswt
Apr 13, 2023

Any news or ETA on this?

I can confirm that the transcription endpoint (using OpenAI's Speech to text API) often translates instead of merely transcribing.

It happens maybe 20 to 50% of the time in my experience, so it really shouldn't be hard to reproduce the issue. But if needed I can provide sample audio.

It's annoying because it makes the behavior of the API seem random. I only want transcriptions, not translations (if I wanted translations I'd use the translation endpoint) 😕.

2 replies

joseluis Apr 13, 2023

I've had a speaker change from spanish to english, whisper correctly changing the transcription from spanish to english, but then when the speaker changes to spanish back again whisper keeps writing in english as if nothing had changed. It's so very random 😅

lewangdev May 30, 2023

I am using faster-whisper, and I tried paramters the initial_prompt to control transcription without translation:

language: Choose automatic detection or the same language as the first 30 seconds of the audio.
initial_prompt: Write in English

Here is the discussion and Google Colab link:
lewangdev/faster-whisper-youtube#1

makuro12 · 2023-06-06T13:29:23Z

makuro12
Jun 6, 2023

You can cut your audio into one-minute slices, and then transcribe that one-minute slice, and I tried this method and got the results I wanted

prompt：This is a blend of Japanese and English speech, recognizing Japanese when encountered, and likewise, English when it comes across it. Please do not translate, only transcription be allowed .

TalifonicaBy contrast,Talifonica, a Spanish telecommunications company that is now the world's fifth largest telecom by revenues, first developed its special advantage abroad. In 1989 and 1990,Talifonica had the opportunity to enter Chile and Argentina, countries that3回が一番多い для Hotelbrیاundjial.似たそれではあり它あり、あげらび航道死今回の国家で都と、日本マーケット版 breakup。开始年クリ 6回これについても、人数相手に確認しました。追っていつまでにご連絡をさし上げますなどという一時的なお返事を送っておくのが、健全です。相手を安心させるためにメールを送っておくのが、健全です。このように、変身速度の重要性は伝わりましたでしょうか。メール変身をすることは、当たり前のような考えは一般的にあるものの、その速度を意識して変身をすることを心がけてみてみませんか。

3 replies

fabswt Jun 22, 2023

Are you sure this works? I mean as in reliably and not just a fluke?

redders123 Jun 22, 2023

@fabswt didn't work for us, let me know if anything works for you.

ryanheise Jun 22, 2023

It's possible that it's an incidental capability in the case of Japanese and English. It is not uncommon to hear Japanese people use English words or phrases, and that capability to occasionally switch to English words could be reflected in how the model learnt from the training data. I have noticed Whisper doing this for Japanese even without prompting.

Lorenzoncina · 2023-08-08T08:39:05Z

Lorenzoncina
Aug 8, 2023

I had the same challenge to face: a multilingual audio file with speakers using different languages. I tried the suggestion about the prompt but I found it not realiable and not always working. Instead, what I did is a segmentation of the input audio file by speaker. In my case the assumption was that each speaker use one single language, so, by segmenting the input audio file by speaker, Whisper run Language detectoin for each of those segments, then you just concatenate them and that's it.

To obtain speaker segmentation you can use several toolkits for Speaker Diarization, I dedice to go with PyAnnote: https://github.com/pyannote/pyannote-audio

So basically you first run Speaker diarization and then transcription of each speaker segment with Whisper.

0 replies

atuad7535 · 2024-03-26T10:34:32Z

atuad7535
Mar 26, 2024

Hi, I am still facing this issue, where multi lingual audio file, its not working, please share the any workarounds, my code:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)

1 reply

glangford Mar 26, 2024

I recently posted example code for transcribing multiple languages in this thread:
#2009

It uses pyannote for speaker diarization, then for each speaker segment it performs language identification and transcription.

Multi-Language Audio and Transcription Inconsistencies #2009 (comment)

atuad7535 · 2024-03-27T05:28:46Z

atuad7535
Mar 27, 2024

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
In this pipeline or is there anyway I can get the accuracy of the transcription?

0 replies

oogxdd · 2024-12-26T18:54:47Z

oogxdd
Dec 26, 2024

Same problem here. Can't solve this using speaker diarization, as single speaker can speak multiple languages. Any ideas on how to solve this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping multiple language transcription #49

{{title}}

Replies: 10 comments 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Keeping multiple language transcription #49

Replies: 10 comments · 14 replies

dehma01 Sep 23, 2022 Author

dehma01 Sep 23, 2022 Author

Replies: 10 comments 14 replies

dehma01 Sep 23, 2022
Author

dehma01 Sep 23, 2022
Author