Add reverb with fast RIR generator #799

desh2608 · 2022-09-01T14:31:15Z

This PR adds the functionality to synthesize a room impulse response on-the-fly for using the reverb_rir() methods in the Cut or Recording (see discussion in #787). The algorithm is based on this paper. Here is an audio sample: Google drive link.

Usage:
Basically, all the same methods would still work, but will use the random RIR generator if no RIR recording is explicitly provided.

…st_rir

lhotse/augmentation/utils.py

desh2608 · 2022-09-01T21:35:14Z

Some time comparisons:

Reverb with RIR specified

In [14]: %timeit reverb_cut_rir(c,r)
The slowest run took 8.14 times longer than the fastest. This could mean that an intermediate result is being cached.
148 ms ± 88.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Reverb with generated RIR (torchaudio Resample)

In [9]: %timeit reverb_cut(c)
179 ms ± 57.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Reverb with generated RIR (cached Resample)

In [15]: %timeit reverb_cut(c)
138 ms ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using Numpy instead of torch

In [6]: %timeit reverb_cut(c)
281 ms ± 121 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Based on above, I'm keeping the torch operations with cached resamplers.

pzelasko · 2022-09-01T21:43:24Z

Wow, nice speed up on PyTorch side. Let’s fix the style checks and LGTM!

pzelasko · 2022-09-01T21:57:05Z

Desh do you have any insight how much simulated vs real RIR matters for tasks such as ASR or diarization? I wonder if given this implementation it still makes sense to use RIR datasets. The audio sample you uploaded sounds nice actually.

desh2608 · 2022-09-01T22:45:37Z

Desh do you have any insight how much simulated vs real RIR matters for tasks such as ASR or diarization? I wonder if given this implementation it still makes sense to use RIR datasets. The audio sample you uploaded sounds nice actually.

I haven't run comparisons myself yet, but at least in the paper it seems to match the results using real RIRs on speech enhancement tasks. Also, older studies have found that at least for ASR, simulated vs real RIRs do not make much difference, so I would imagine that this method should be comparable with those.

In any case, I am hoping to prepare an AMI recipe for icefall soon, and I can conduct such a comparison then.

pzelasko · 2022-09-01T23:02:08Z

Thanks!

## [workflow] Multi-talker meeting simulation A new workflow is implemented that can be used to create multi-talker meeting-like mixtures using single utterances. For context, see discussion in #823. ### Methods Currently, we have implemented 2 methods for simulation. 1. **Speaker-independent:** This is based on the simulation algorithm described in [the original EEND paper from Hitachi](https://arxiv.org/abs/1909.06247). The idea is to simulate each speaker's track independently of others by concatenating utterances with pauses sampled from an exponential distribution. All tracks are then mixed together. The parameters here are the `loc` and `scale` of the exponential distribution to sample the pauses. 2. **Conversational:** This is based on the algorithm described in [this paper from BUT](https://arxiv.org/abs/2204.00890). Instead of simulating each speaker track independently, we first sample N utterances and shuffle them. They are then mixed together by sampling pause or overlap duration from a randomly initialized gamma distribution, or from learned histograms. The parameters here are `same_spk_pause`, `diff_spk_pause`, `diff_spk_overlap`, and `prob_diff_spk_overlap`. More simulation methods will be implemented in the future. ### Implementation All simulation algorithms are implemented in their own classes that are derived from an abstract class `BaseMeetingSimulator`. The classes should implement the methods `fit()`, `simulate()`, and `reverberate()`: * `fit()` learns the parameters of the simulator based on a provided `SupervisionSet`, or sensible defaults if none is provided. * `simulate()` generates meetings. Each meeting is a `MixedCut` object, where each track represents a specific source (speaker). The track itself may be a `MonoCut` (if there is only 1 utterance for the speaker) or a `MixedCut` (if multiple utterances are concatenated with padding). The input to `simulate()` is a `CutSet` containing single utterances that will be used for simulation. The number of meetings is controlled either by a desired number of meetings to be generated (`num_meetings`), or the number of times each utterance will be used (`num_repeats`). * `reverberate()` convolves the meeting with RIRs. Either some external RIRs can be provided, or we will generate RIRs on-the-fly using a fast random approximation (see #799). ### Usage 1. Python ```python from lhotse.workflows.meeting_simulation import ConversationalMeetingSimulator # "sups" is a SupervisionSet for a meeting-style dataset such as AMI # "cuts" is a CutSet containing single utterances, e.g., LibriSpeech simulator = ConversationalMeetingSimulator() simulator.fit(sups) meetings = simulator.simulate(cuts, num_repeats=5) meetings = simulator.reverberate(cuts) ``` 2. CLI ``` lhotse workflows simulate-meetings cuts.jsonl.gz out_cuts.jsonl.gz -m conversational -f sups.jsonl.gz -r 5 --reverberate ``` For full CLI usage, see `lhotse workflows simulate-meetings --help`. ### Example We generated mixtures using the LibriSpeech `dev-clean` subset (`num_repeat=2, num_speakers_per_meeting=(3,4)`), based on the distribution learned from AMI dev set. **Method = speaker-independent** ``` $ lhotse workflows simulate-meetings data/manifests/librispeech_cuts_trimmed_dev-clean.jsonl.gz mix_si.jsonl.gz -m independent -f ami-sdm_supervisions_dev.jsonl.gz -r 2 -s 3,4 --reverberate Fitting the meeting simulator to the provided supervisions... Learned parameters: loc=0.00, scale=7.77 Simulating meetings... 895it [00:02, 399.64it/s] Reverberating the simulated meetings... Saving the simulated meetings... ``` **Method = conversational** ``` $ lhotse workflows simulate-meetings data/manifests/librispeech_cuts_trimmed_dev-clean.jsonl.gz mix_conv.jsonl.gz -m conversational -f ami-sdm_supervisions_dev.jsonl.gz -r 2 -s 3,4 --reverberate Fitting the meeting simulator to the provided supervisions... Learned parameters: ConversationalMeetingSimulator (same_spk_pause=8.66, diff_spk_pause=1.55, diff_spk_overlap=1.62), prob_diff_spk_overlap=0.55) Simulating meetings... 881it [00:02, 339.92it/s] Reverberating the simulated meetings... Saving the simulated meetings... ``` Statistics of generated meetings: **Method = speaker-independent** ``` Cut statistics: ╒═══════════════════════════╤══════════╕ │ Cuts count: │ 895 │ ├───────────────────────────┼──────────┤ │ Total duration (hh:mm:ss) │ 13:13:55 │ ├───────────────────────────┼──────────┤ │ mean │ 53.2 │ ├───────────────────────────┼──────────┤ │ std │ 17.2 │ ├───────────────────────────┼──────────┤ │ min │ 15.4 │ ├───────────────────────────┼──────────┤ │ 25% │ 40.9 │ ├───────────────────────────┼──────────┤ │ 50% │ 51.3 │ ├───────────────────────────┼──────────┤ │ 75% │ 62.2 │ ├───────────────────────────┼──────────┤ │ 99% │ 103.8 │ ├───────────────────────────┼──────────┤ │ 99.5% │ 111.6 │ ├───────────────────────────┼──────────┤ │ 99.9% │ 124.3 │ ├───────────────────────────┼──────────┤ │ max │ 138.4 │ ├───────────────────────────┼──────────┤ │ Recordings available: │ 895 │ ├───────────────────────────┼──────────┤ │ Features available: │ 0 │ ├───────────────────────────┼──────────┤ │ Supervisions available: │ 11024 │ ╘═══════════════════════════╧══════════╛ Speech duration statistics: ╒══════════════════════════════╤══════════╤═══════════════════════════╕ │ Total speech duration │ 07:11:33 │ 54.36% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total speaking time duration │ 09:35:25 │ 72.48% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total silence duration │ 06:02:22 │ 45.64% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Single-speaker duration │ 05:08:31 │ 38.86% (71.49% of speech) │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Overlapped speech duration │ 02:03:03 │ 15.50% (28.51% of speech) │ ╘══════════════════════════════╧══════════╧═══════════════════════════╛ Speech duration statistics by number of speakers: ╒══════════════════════╤═══════════════════════╤════════════════════════════╤═══════════════╤══════════════════════╕ │ Number of speakers │ Duration (hh:mm:ss) │ Speaking time (hh:mm:ss) │ % of speech │ % of speaking time │ ╞══════════════════════╪═══════════════════════╪════════════════════════════╪═══════════════╪══════════════════════╡ │ 1 │ 05:08:31 │ 05:08:31 │ 71.49% │ 53.61% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 2 │ 01:43:20 │ 03:26:40 │ 23.94% │ 35.92% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 3 │ 00:18:35 │ 00:55:43 │ 4.30% │ 9.68% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 4 │ 00:01:08 │ 00:04:32 │ 0.26% │ 0.79% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ Total │ 07:11:33 │ 09:35:25 │ 100.00% │ 100.00% │ ╘══════════════════════╧═══════════════════════╧════════════════════════════╧═══════════════╧══════════════════════╛ ``` **Method = conversational** ``` Cut statistics: ╒═══════════════════════════╤══════════╕ │ Cuts count: │ 881 │ ├───────────────────────────┼──────────┤ │ Total duration (hh:mm:ss) │ 15:32:42 │ ├───────────────────────────┼──────────┤ │ mean │ 63.5 │ ├───────────────────────────┼──────────┤ │ std │ 36.7 │ ├───────────────────────────┼──────────┤ │ min │ 21.9 │ ├───────────────────────────┼──────────┤ │ 25% │ 43.5 │ ├───────────────────────────┼──────────┤ │ 50% │ 53.6 │ ├───────────────────────────┼──────────┤ │ 75% │ 69.9 │ ├───────────────────────────┼──────────┤ │ 99% │ 221.1 │ ├───────────────────────────┼──────────┤ │ 99.5% │ 237.6 │ ├───────────────────────────┼──────────┤ │ 99.9% │ 327.3 │ ├───────────────────────────┼──────────┤ │ max │ 368.0 │ ├───────────────────────────┼──────────┤ │ Recordings available: │ 881 │ ├───────────────────────────┼──────────┤ │ Features available: │ 0 │ ├───────────────────────────┼──────────┤ │ Supervisions available: │ 11024 │ ╘═══════════════════════════╧══════════╛ Speech duration statistics: ╒══════════════════════════════╤══════════╤═══════════════════════════╕ │ Total speech duration │ 08:24:60 │ 54.14% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total speaking time duration │ 09:35:25 │ 61.69% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total silence duration │ 07:07:43 │ 45.86% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Single-speaker duration │ 07:15:21 │ 46.68% (86.21% of speech) │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Overlapped speech duration │ 01:09:39 │ 7.47% (13.79% of speech) │ ╘══════════════════════════════╧══════════╧═══════════════════════════╛ Speech duration statistics by number of speakers: ╒══════════════════════╤═══════════════════════╤════════════════════════════╤═══════════════╤══════════════════════╕ │ Number of speakers │ Duration (hh:mm:ss) │ Speaking time (hh:mm:ss) │ % of speech │ % of speaking time │ ╞══════════════════════╪═══════════════════════╪════════════════════════════╪═══════════════╪══════════════════════╡ │ 1 │ 07:15:21 │ 07:15:21 │ 86.21% │ 75.66% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 2 │ 01:08:52 │ 02:17:44 │ 13.64% │ 23.93% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 3 │ 00:00:48 │ 00:02:22 │ 0.16% │ 0.41% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 4 │ 00:00:00 │ 00:00:00 │ 0.00% │ 0.00% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ Total │ 08:24:60 │ 09:35:25 │ 100.00% │ 100.00% │ ╘══════════════════════╧═══════════════════════╧════════════════════════════╧═══════════════╧══════════════════════╛ ```

add reverb with fast rir generator

540b2b0

desh2608 added the enhancement New feature or request label Sep 1, 2022

desh2608 added 3 commits September 1, 2022 10:54

formatting fixes

3005166

fix isort issue

abf0708

Merge branch 'master' into fast_rir

f18eb65

desh2608 changed the title ~~Add reverb with fast RIR generator~~ WIP: Add reverb with fast RIR generator Sep 1, 2022

desh2608 added 2 commits September 1, 2022 12:45

add note in MixedCut reverb_rir

01e5cee

Merge branch 'fast_rir' of https://github.com/desh2608/lhotse into fa…

8065755

…st_rir

desh2608 changed the title ~~WIP: Add reverb with fast RIR generator~~ Add reverb with fast RIR generator Sep 1, 2022

desh2608 requested a review from pzelasko September 1, 2022 16:47

pzelasko reviewed Sep 1, 2022

View reviewed changes

lhotse/augmentation/utils.py Outdated Show resolved Hide resolved

pzelasko reviewed Sep 1, 2022

View reviewed changes

lhotse/augmentation/utils.py Show resolved Hide resolved

use cached resampler

60bdc56

isort

6841af8

style check

cba22b1

pzelasko merged commit 48fc528 into lhotse-speech:master Sep 1, 2022

pzelasko added this to the v1.7 milestone Sep 1, 2022

desh2608 deleted the fast_rir branch September 1, 2022 23:30

desh2608 mentioned this pull request Sep 2, 2022

Integrating fast RIR in Lhotse tencent-ailab/FRA-RIR#1

Closed

desh2608 mentioned this pull request Dec 17, 2022

[workflow] Multi-talker meeting simulation #929

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reverb with fast RIR generator #799

Add reverb with fast RIR generator #799

desh2608 commented Sep 1, 2022 •

edited

Loading

desh2608 commented Sep 1, 2022

pzelasko commented Sep 1, 2022

pzelasko commented Sep 1, 2022

desh2608 commented Sep 1, 2022

pzelasko commented Sep 1, 2022

Add reverb with fast RIR generator #799

Add reverb with fast RIR generator #799

Conversation

desh2608 commented Sep 1, 2022 • edited Loading

desh2608 commented Sep 1, 2022

pzelasko commented Sep 1, 2022

pzelasko commented Sep 1, 2022

desh2608 commented Sep 1, 2022

pzelasko commented Sep 1, 2022

desh2608 commented Sep 1, 2022 •

edited

Loading