Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when transcribing audio with Whisper #891

Closed
SGombert opened this issue Nov 15, 2022 · 15 comments
Closed

Bug when transcribing audio with Whisper #891

SGombert opened this issue Nov 15, 2022 · 15 comments
Labels
bug Something isn't working

Comments

@SGombert
Copy link

Hi there,

I am currently working on a project where we first transcribe audio recordings with Whisper and then force align them with Wav2Vec.

When I just run the cuts generated by the method "annotate_with_whisper" into the method "align_with_torchaudio", I get the following error:

AssertionError: We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'A011221._001_TR6')

When I try to trim the supervisions, I run into another issue where suddenly supervisions with a negative duration appear.

So far, I could not find a fix myself. Do you have any suggestions?

@pzelasko
Copy link
Collaborator

Can you trace at which point you got overlapping supervisions from Whisper and post more information? I think that generally shouldn't happen, it may be a bug in the annotate_with_whisper workflow.

@vorpaladin
Copy link

I am having the same problem with force aligning Whisper output with wav2vec using annotate_with_whisper. I'm getting a 14.4% failure rate due to this issue. I ran two sets of 90 speech audio files and had 13 fail in this way from both sets. The sets are disjoint, all files contain different speech. Durations, noise, and accents vary for the failures, from 10 minute to 60 minute durations, and accents from different UK regions. I examined each audio file that resulted in failure and did not discern any obvious commonality between them. I am using Whisper settings device="cuda", language="en" and model_name="small.en".

@pzelasko
Copy link
Collaborator

Can you show an example of a cut with overlapping supervisions? Maybe that would help trace it if it's a bug in how we parse Whisper or something else.

@mattzque
Copy link

I might be seeing the same or similar issues (I think) but perhaps I'm doing something wrong.

Code:

from lhotse import (
    CutSet,
    Recording,
    RecordingSet,
    align_with_torchaudio,
    annotate_with_whisper,
)
from pathlib import Path
from pprint import pprint

recording = Recording.from_file(Path("./bug2.wav"))
recordings = RecordingSet.from_recordings([recording])

cuts = annotate_with_whisper(recordings, model_name="tiny", device="cuda")

cuts_aligned = align_with_torchaudio(cuts)

next(cuts_aligned)

Error:

ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(171.0, 171.0, SupervisionSegment(id='bug2-000058', recording_id='bug2', start=171.0, duration=0.0, channel=0, text='Yeah.', language='en', speaker=None, gender=None, custom=None, alignment=None))
py

Added logging of the cut, here is the full output: https://gist.github.com/mattzque/0e048e68ff5a0034a8c7d21511d2e2b0

When I change the model from tiny to medium, I get this error:

AssertionError: We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'bug2')

Full output of the cut again: https://gist.github.com/mattzque/6762b16ea6645454c1cb250a86aecb5b

Using lhotse==1.11.0 and whisper==1.1.10

bug2.wav.zip

@desh2608
Copy link
Collaborator

I think both of these bugs may be related to the timestamp postprocessing that is performed on the segments generated by Whisper (see here). We truncate a segment's end time to the start time of the next segment, so that no overlapping segments are generated. I think a small bug might be that the segments are not sorted by start time before the post-processing. Could you try adding supervisions = sorted(supervisions, key=lambda s: s.start) at the start of this function? I think this would fix the AssertionError about the overlapping supervisions.

However, I feel that there is a larger issue here. This kind of post-processing probably causes a lot of long segments to get truncated. Consider the following case:

    ┌──────────────────────────────────────────────────────────────────────┐
    │                         Hello this is John.                          │
    └──────────────────────────────────────────────────────────────────────┘
                                                   ┌────────┐
                                                   │   Hi   │
                                                   └────────┘

Whisper recognizes a long segment and a short overlapping segment, but if we apply the post-processing, the long segment's end will get truncated to the start time of the short segment, which will result in cutting a lot of the audio. I feel like we should disable the post-processing by default and only provide it as an additional option to be enabled. This ties into the issue with overlapping supervisions in the forced alignment workflow as follows.

In the align_with_torchaudio workflow, since we already trim the cuts to supervisions (with keep_overlapping=False) before performing alignments, perhaps we can remove the assertion about the non-overlapping supervisions? The alignment won't be very good on the overlapped regions, but at least it would be more flexible, and we won't truncate or remove any supervision segments.

@desh2608 desh2608 changed the title Bug when transcribing audio with Whipser Bug when transcribing audio with Whisper Dec 15, 2022
@desh2608 desh2608 added the bug Something isn't working label Dec 15, 2022
@mattzque
Copy link

Could you try adding supervisions = sorted(supervisions, key=lambda s: s.start) at the start of this function?

If I add this before the assertion in align_with_torchaudio:

[...]
for cut in cuts:
    supervisions = cut.supervisions
    supervisions = sorted(supervisions, key=lambda s: s.start)
    cut.supervisions = supervisions

    assert not cut.has_overlapping_supervisions, [...]

I get the same error "We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'bug2')".

If I remove the assertion, I get this error:

ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(99.92, 90.72, SupervisionSegment(id='bug2-000031', recording_id='bug2', start=99.92, duration=-9.2, channel=0, text='And feeling the', language='en', speaker=None, gender=None, custom=None, alignment=None))

@desh2608
Copy link
Collaborator

No I meant that you may need to bypass the whole _postprocess_timestamps function in the whisper workflow.

@desh2608
Copy link
Collaborator

Could you try adding supervisions = sorted(supervisions, key=lambda s: s.start) at the start of this function?

If I add this before the assertion in align_with_torchaudio:

[...]
for cut in cuts:
    supervisions = cut.supervisions
    supervisions = sorted(supervisions, key=lambda s: s.start)
    cut.supervisions = supervisions

    assert not cut.has_overlapping_supervisions, [...]

I get the same error "We don't support forced alignment of cuts with overlapping supervisions (cut ID: 'bug2')".

If I remove the assertion, I get this error:

ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(99.92, 90.72, SupervisionSegment(id='bug2-000031', recording_id='bug2', start=99.92, duration=-9.2, channel=0, text='And feeling the', language='en', speaker=None, gender=None, custom=None, alignment=None))

Also, I meant to add the sorting at the start of the postprocess_timestamps function.

@pzelasko
Copy link
Collaborator

Whisper recognizes a long segment and a short overlapping segment, but if we apply the post-processing, the long segment's end will get truncated to the start time of the short segment, which will result in cutting a lot of the audio. I feel like we should disable the post-processing by default and only provide it as an additional option to be enabled. This ties into the issue with overlapping supervisions in the forced alignment workflow as follows.

But I think that Whisper cannot recognize/annotate overlapping speech segments. It would simply show it as a single recognition/segment. I don't really understand where the overlaps are coming from. I'll try to find some time to debug this but it might have to be after holidays.

In the align_with_torchaudio workflow, since we already trim the cuts to supervisions (with keep_overlapping=False) before performing alignments, perhaps we can remove the assertion about the non-overlapping supervisions? The alignment won't be very good on the overlapped regions, but at least it would be more flexible, and we won't truncate or remove any supervision segments.

If we must live with overlapping segments from Whisper then I agree, this solution looks reasonable.

@desh2608
Copy link
Collaborator

I looked at the whisper output on the audio, and it seems the timestamps from Whisper itself are erroneous on that segment ('start': 99.92, 'end': 90.72), so there's not much we can do. I will make a PR to filter such segments, and to incorporate the other discussion.

@mattzque
Copy link

mattzque commented Dec 15, 2022

Thanks for looking into this, see also the example audio file I attached above.

@pzelasko
Copy link
Collaborator

Thanks guys!

@desh2608
Copy link
Collaborator

@mattzque the above PR should fix your issue.

pzelasko added a commit that referenced this issue Dec 16, 2022
This PR addresses #891.

* Remove segments with non-positive duration from the whisper output
* Segment post-processing to force non-overlapping is made optional
(disabled by default)
* Allow overlapping segments in forced alignment workflow
@mattzque
Copy link

@desh2608 Thank you! This fixes the issues for me!

@desh2608
Copy link
Collaborator

Resolved by #928.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants