Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper - list index out of range with word level timestamps #31683

Open
2 of 4 tasks
maxkvbn opened this issue Jun 28, 2024 · 1 comment
Open
2 of 4 tasks

Whisper - list index out of range with word level timestamps #31683

maxkvbn opened this issue Jun 28, 2024 · 1 comment
Labels

Comments

@maxkvbn
Copy link

maxkvbn commented Jun 28, 2024

System Info

  • transformers version: 4.42.2
  • Platform: Windows-10-10.0.22621-SP0
  • Python version: 3.10.14
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.2
  • Accelerate version: 0.31.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: Yes
  • GPU type: NVIDIA GeForce RTX 4070 Laptop GPU

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Load 'whisper-large-v3' AutoModelForSpeechSeq2Seq model and send it to GPU
  2. Setup model config with return_timestamps="word", amongst other settings.
  3. Run audio through the pipe, which results in error
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[8], [line 1](vscode-notebook-cell:?execution_count=8&line=1)
----> [1](vscode-notebook-cell:?execution_count=8&line=1) asr_out = transcribing_pipe(
      [2](vscode-notebook-cell:?execution_count=8&line=2)     '../SampleData/Saba_interview_short.wav',
      [3](vscode-notebook-cell:?execution_count=8&line=3)     return_timestamps="word",
      [4](vscode-notebook-cell:?execution_count=8&line=4)     generate_kwargs={"language": "danish"}
      [5](vscode-notebook-cell:?execution_count=8&line=5)     )
      [7](vscode-notebook-cell:?execution_count=8&line=7) asr_out

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py:284, in AutomaticSpeechRecognitionPipeline.__call__(self, inputs, **kwargs)
    [221](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:221) def __call__(
    [222](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:222)     self,
    [223](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:223)     inputs: Union[np.ndarray, bytes, str],
    [224](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:224)     **kwargs,
    [225](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:225) ):
    [226](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:226)     """
    [227](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:227)     Transcribe the audio sequence(s) given as inputs to text. See the [`AutomaticSpeechRecognitionPipeline`]
    [228](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:228)     documentation for more information.
   (...)
    [282](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:282)                 `"".join(chunk["text"] for chunk in output["chunks"])`.
    [283](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:283)     """
--> [284](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:284)     return super().__call__(inputs, **kwargs)

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\base.py:1246, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   [1244](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1244)     return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
   [1245](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1245) elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> [1246](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1246)     return next(
   [1247](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1247)         iter(
   [1248](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1248)             self.get_iterator(
   [1249](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1249)                 [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
   [1250](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1250)             )
   [1251](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1251)         )
   [1252](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1252)     )
   [1253](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1253) else:
   [1254](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1254)     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\pt_utils.py:125, in PipelineIterator.__next__(self)
    [123](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:123) # We're out of items within a batch
    [124](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:124) item = next(self.iterator)
--> [125](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:125) processed = self.infer(item, **self.params)
    [126](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:126) # We now have a batch of "inferred things".
    [127](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:127) if self.loader_batch_size is not None:
    [128](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:128)     # Try to infer the size of the batch

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py:587, in AutomaticSpeechRecognitionPipeline.postprocess(self, model_outputs, decoder_kwargs, return_timestamps, return_language)
    [584](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:584)             stride_right /= sampling_rate
    [585](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:585)             output["stride"] = chunk_len, stride_left, stride_right
--> [587](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:587)     text, optional = self.tokenizer._decode_asr(
    [588](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:588)         model_outputs,
    [589](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:589)         return_timestamps=return_timestamps,
    [590](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:590)         return_language=return_language,
    [591](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:591)         time_precision=time_precision,
    [592](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:592)     )
    [593](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:593) else:
    [594](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:594)     items = np.concatenate(final_items, axis=1)

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\models\whisper\tokenization_whisper.py:832, in WhisperTokenizer._decode_asr(self, model_outputs, return_timestamps, return_language, time_precision)
    [831](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:831) def _decode_asr(self, model_outputs, *, return_timestamps, return_language, time_precision):
--> [832](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:832)     return _decode_asr(
    [833](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:833)         self,
    [834](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:834)         model_outputs,
    [835](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:835)         return_timestamps=return_timestamps,
    [836](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:836)         return_language=return_language,
    [837](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:837)         time_precision=time_precision,
    [838](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:838)     )

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\models\whisper\tokenization_whisper.py:1032, in _decode_asr(tokenizer, model_outputs, return_timestamps, return_language, time_precision)
   [1030](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1030) current_tokens.append(token)
   [1031](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1031) if return_timestamps == "word":
-> [1032](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1032)     start_time = round(token_timestamps[i] + time_offset, 2)
   [1033](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1033)     if i + 1 < len(token_timestamps):
   [1034](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1034)         end_time = round(token_timestamps[i + 1] + time_offset, 2)

IndexError: list index out of range

I've uploaded the audio that I am trying to process here

I have a discussion on the whisper's hub page, where I have implemented a fix that seems to work. here

Colab link here

Expected behavior

  1. Load 'whisper-large-v3' AutoModelForSpeechSeq2Seq model and send it to GPU
  2. Setup model config with return_timestamps="word", amongst other settings.
  3. Run audio through the pipe, which returns transcription together with the word level timestamps segment.
@amyeroberts
Copy link
Collaborator

cc @sanchit-gandhi @kamilakesbi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants