Bug when transcribing audio with Whisper #891

SGombert · 2022-11-15T17:54:45Z

Hi there,

I am currently working on a project where we first transcribe audio recordings with Whisper and then force align them with Wav2Vec.

When I just run the cuts generated by the method "annotate_with_whisper" into the method "align_with_torchaudio", I get the following error:

AssertionError: We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'A011221._001_TR6')

When I try to trim the supervisions, I run into another issue where suddenly supervisions with a negative duration appear.

So far, I could not find a fix myself. Do you have any suggestions?

pzelasko · 2022-11-15T18:08:52Z

Can you trace at which point you got overlapping supervisions from Whisper and post more information? I think that generally shouldn't happen, it may be a bug in the annotate_with_whisper workflow.

vorpaladin · 2022-11-16T15:33:40Z

I am having the same problem with force aligning Whisper output with wav2vec using annotate_with_whisper. I'm getting a 14.4% failure rate due to this issue. I ran two sets of 90 speech audio files and had 13 fail in this way from both sets. The sets are disjoint, all files contain different speech. Durations, noise, and accents vary for the failures, from 10 minute to 60 minute durations, and accents from different UK regions. I examined each audio file that resulted in failure and did not discern any obvious commonality between them. I am using Whisper settings device="cuda", language="en" and model_name="small.en".

pzelasko · 2022-12-14T18:41:05Z

Can you show an example of a cut with overlapping supervisions? Maybe that would help trace it if it's a bug in how we parse Whisper or something else.

mattzque · 2022-12-15T00:50:08Z

I might be seeing the same or similar issues (I think) but perhaps I'm doing something wrong.

Code:

from lhotse import (
    CutSet,
    Recording,
    RecordingSet,
    align_with_torchaudio,
    annotate_with_whisper,
)
from pathlib import Path
from pprint import pprint

recording = Recording.from_file(Path("./bug2.wav"))
recordings = RecordingSet.from_recordings([recording])

cuts = annotate_with_whisper(recordings, model_name="tiny", device="cuda")

cuts_aligned = align_with_torchaudio(cuts)

next(cuts_aligned)

Error:

ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(171.0, 171.0, SupervisionSegment(id='bug2-000058', recording_id='bug2', start=171.0, duration=0.0, channel=0, text='Yeah.', language='en', speaker=None, gender=None, custom=None, alignment=None))
py

Added logging of the cut, here is the full output: https://gist.github.com/mattzque/0e048e68ff5a0034a8c7d21511d2e2b0

When I change the model from tiny to medium, I get this error:

AssertionError: We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'bug2')

Full output of the cut again: https://gist.github.com/mattzque/6762b16ea6645454c1cb250a86aecb5b

Using lhotse==1.11.0 and whisper==1.1.10

bug2.wav.zip

desh2608 · 2022-12-15T04:35:07Z

I think both of these bugs may be related to the timestamp postprocessing that is performed on the segments generated by Whisper (see here). We truncate a segment's end time to the start time of the next segment, so that no overlapping segments are generated. I think a small bug might be that the segments are not sorted by start time before the post-processing. Could you try adding supervisions = sorted(supervisions, key=lambda s: s.start) at the start of this function? I think this would fix the AssertionError about the overlapping supervisions.

However, I feel that there is a larger issue here. This kind of post-processing probably causes a lot of long segments to get truncated. Consider the following case:

    ┌──────────────────────────────────────────────────────────────────────┐
    │                         Hello this is John.                          │
    └──────────────────────────────────────────────────────────────────────┘
                                                   ┌────────┐
                                                   │   Hi   │
                                                   └────────┘

Whisper recognizes a long segment and a short overlapping segment, but if we apply the post-processing, the long segment's end will get truncated to the start time of the short segment, which will result in cutting a lot of the audio. I feel like we should disable the post-processing by default and only provide it as an additional option to be enabled. This ties into the issue with overlapping supervisions in the forced alignment workflow as follows.

In the align_with_torchaudio workflow, since we already trim the cuts to supervisions (with keep_overlapping=False) before performing alignments, perhaps we can remove the assertion about the non-overlapping supervisions? The alignment won't be very good on the overlapped regions, but at least it would be more flexible, and we won't truncate or remove any supervision segments.

mattzque · 2022-12-15T13:32:46Z

Could you try adding supervisions = sorted(supervisions, key=lambda s: s.start) at the start of this function?

If I add this before the assertion in align_with_torchaudio:

[...]
for cut in cuts:
    supervisions = cut.supervisions
    supervisions = sorted(supervisions, key=lambda s: s.start)
    cut.supervisions = supervisions

    assert not cut.has_overlapping_supervisions, [...]

I get the same error "We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'bug2')".

If I remove the assertion, I get this error:

ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(99.92, 90.72, SupervisionSegment(id='bug2-000031', recording_id='bug2', start=99.92, duration=-9.2, channel=0, text='And feeling the', language='en', speaker=None, gender=None, custom=None, alignment=None))

desh2608 · 2022-12-15T15:08:20Z

No I meant that you may need to bypass the whole _postprocess_timestamps function in the whisper workflow.

desh2608 · 2022-12-15T15:09:45Z

Could you try adding supervisions = sorted(supervisions, key=lambda s: s.start) at the start of this function?

If I add this before the assertion in align_with_torchaudio:
[...]
for cut in cuts:
    supervisions = cut.supervisions
    supervisions = sorted(supervisions, key=lambda s: s.start)
    cut.supervisions = supervisions

    assert not cut.has_overlapping_supervisions, [...]
I get the same error "We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'bug2')".

If I remove the assertion, I get this error:
ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(99.92, 90.72, SupervisionSegment(id='bug2-000031', recording_id='bug2', start=99.92, duration=-9.2, channel=0, text='And feeling the', language='en', speaker=None, gender=None, custom=None, alignment=None))

Also, I meant to add the sorting at the start of the postprocess_timestamps function.

pzelasko · 2022-12-15T16:48:22Z

Whisper recognizes a long segment and a short overlapping segment, but if we apply the post-processing, the long segment's end will get truncated to the start time of the short segment, which will result in cutting a lot of the audio. I feel like we should disable the post-processing by default and only provide it as an additional option to be enabled. This ties into the issue with overlapping supervisions in the forced alignment workflow as follows.

But I think that Whisper cannot recognize/annotate overlapping speech segments. It would simply show it as a single recognition/segment. I don't really understand where the overlaps are coming from. I'll try to find some time to debug this but it might have to be after holidays.

In the align_with_torchaudio workflow, since we already trim the cuts to supervisions (with keep_overlapping=False) before performing alignments, perhaps we can remove the assertion about the non-overlapping supervisions? The alignment won't be very good on the overlapped regions, but at least it would be more flexible, and we won't truncate or remove any supervision segments.

If we must live with overlapping segments from Whisper then I agree, this solution looks reasonable.

desh2608 · 2022-12-15T17:06:09Z

I looked at the whisper output on the audio, and it seems the timestamps from Whisper itself are erroneous on that segment ('start': 99.92, 'end': 90.72), so there's not much we can do. I will make a PR to filter such segments, and to incorporate the other discussion.

mattzque · 2022-12-15T17:10:42Z

Thanks for looking into this, see also the example audio file I attached above.

pzelasko · 2022-12-15T18:28:35Z

Thanks guys!

desh2608 · 2022-12-16T04:46:25Z

@mattzque the above PR should fix your issue.

This PR addresses #891. * Remove segments with non-positive duration from the whisper output * Segment post-processing to force non-overlapping is made optional (disabled by default) * Allow overlapping segments in forced alignment workflow

mattzque · 2022-12-16T13:13:27Z

@desh2608 Thank you! This fixes the issues for me!

desh2608 · 2022-12-19T07:37:03Z

Resolved by #928.

desh2608 changed the title ~~Bug when transcribing audio with Whipser~~ Bug when transcribing audio with Whisper Dec 15, 2022

desh2608 added the bug Something isn't working label Dec 15, 2022

desh2608 mentioned this issue Dec 16, 2022

Remove negative duration segments from whisper #928

Merged

desh2608 closed this as completed Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug when transcribing audio with Whisper #891

Bug when transcribing audio with Whisper #891

SGombert commented Nov 15, 2022

pzelasko commented Nov 15, 2022

vorpaladin commented Nov 16, 2022

pzelasko commented Dec 14, 2022

mattzque commented Dec 15, 2022

desh2608 commented Dec 15, 2022

mattzque commented Dec 15, 2022

desh2608 commented Dec 15, 2022

desh2608 commented Dec 15, 2022

pzelasko commented Dec 15, 2022

desh2608 commented Dec 15, 2022

mattzque commented Dec 15, 2022 •

edited

Loading

pzelasko commented Dec 15, 2022

desh2608 commented Dec 16, 2022

mattzque commented Dec 16, 2022

desh2608 commented Dec 19, 2022

Bug when transcribing audio with Whisper #891

Bug when transcribing audio with Whisper #891

Comments

SGombert commented Nov 15, 2022

pzelasko commented Nov 15, 2022

vorpaladin commented Nov 16, 2022

pzelasko commented Dec 14, 2022

mattzque commented Dec 15, 2022

desh2608 commented Dec 15, 2022

mattzque commented Dec 15, 2022

desh2608 commented Dec 15, 2022

desh2608 commented Dec 15, 2022

pzelasko commented Dec 15, 2022

desh2608 commented Dec 15, 2022

mattzque commented Dec 15, 2022 • edited Loading

pzelasko commented Dec 15, 2022

desh2608 commented Dec 16, 2022

mattzque commented Dec 16, 2022

desh2608 commented Dec 19, 2022

mattzque commented Dec 15, 2022 •

edited

Loading