Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reverb with fast RIR generator #799

Merged
merged 9 commits into from
Sep 1, 2022
Merged

Conversation

desh2608
Copy link
Collaborator

@desh2608 desh2608 commented Sep 1, 2022

This PR adds the functionality to synthesize a room impulse response on-the-fly for using the reverb_rir() methods in the Cut or Recording (see discussion in #787). The algorithm is based on this paper. Here is an audio sample: Google drive link.

Usage:
Basically, all the same methods would still work, but will use the random RIR generator if no RIR recording is explicitly provided.

@desh2608 desh2608 added the enhancement New feature or request label Sep 1, 2022
@desh2608 desh2608 changed the title Add reverb with fast RIR generator WIP: Add reverb with fast RIR generator Sep 1, 2022
@desh2608 desh2608 changed the title WIP: Add reverb with fast RIR generator Add reverb with fast RIR generator Sep 1, 2022
@desh2608 desh2608 requested a review from pzelasko September 1, 2022 16:47
@desh2608
Copy link
Collaborator Author

desh2608 commented Sep 1, 2022

Some time comparisons:

  1. Reverb with RIR specified
In [14]: %timeit reverb_cut_rir(c,r)
The slowest run took 8.14 times longer than the fastest. This could mean that an intermediate result is being cached.
148 ms ± 88.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. Reverb with generated RIR (torchaudio Resample)
In [9]: %timeit reverb_cut(c)
179 ms ± 57.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. Reverb with generated RIR (cached Resample)
In [15]: %timeit reverb_cut(c)
138 ms ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. Using Numpy instead of torch
In [6]: %timeit reverb_cut(c)
281 ms ± 121 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Based on above, I'm keeping the torch operations with cached resamplers.

@pzelasko
Copy link
Collaborator

pzelasko commented Sep 1, 2022

Wow, nice speed up on PyTorch side. Let’s fix the style checks and LGTM!

@pzelasko
Copy link
Collaborator

pzelasko commented Sep 1, 2022

Desh do you have any insight how much simulated vs real RIR matters for tasks such as ASR or diarization? I wonder if given this implementation it still makes sense to use RIR datasets. The audio sample you uploaded sounds nice actually.

@desh2608
Copy link
Collaborator Author

desh2608 commented Sep 1, 2022

Desh do you have any insight how much simulated vs real RIR matters for tasks such as ASR or diarization? I wonder if given this implementation it still makes sense to use RIR datasets. The audio sample you uploaded sounds nice actually.

I haven't run comparisons myself yet, but at least in the paper it seems to match the results using real RIRs on speech enhancement tasks. Also, older studies have found that at least for ASR, simulated vs real RIRs do not make much difference, so I would imagine that this method should be comparable with those.

In any case, I am hoping to prepare an AMI recipe for icefall soon, and I can conduct such a comparison then.

@pzelasko
Copy link
Collaborator

pzelasko commented Sep 1, 2022

Thanks!

@pzelasko pzelasko merged commit 48fc528 into lhotse-speech:master Sep 1, 2022
@pzelasko pzelasko added this to the v1.7 milestone Sep 1, 2022
@desh2608 desh2608 deleted the fast_rir branch September 1, 2022 23:30
pzelasko added a commit that referenced this pull request Jan 4, 2023
## [workflow] Multi-talker meeting simulation

A new workflow is implemented that can be used to create multi-talker
meeting-like mixtures using single utterances. For context, see
discussion in #823.

### Methods

Currently, we have implemented 2 methods for simulation.

1. **Speaker-independent:** This is based on the simulation algorithm
described in [the original EEND paper from
Hitachi](https://arxiv.org/abs/1909.06247). The idea is to simulate each
speaker's track independently of others by concatenating utterances with
pauses sampled from an exponential distribution. All tracks are then
mixed together. The parameters here are the `loc` and `scale` of the
exponential distribution to sample the pauses.

2. **Conversational:** This is based on the algorithm described in [this
paper from BUT](https://arxiv.org/abs/2204.00890). Instead of simulating
each speaker track independently, we first sample N utterances and
shuffle them. They are then mixed together by sampling pause or overlap
duration from a randomly initialized gamma distribution, or from learned
histograms. The parameters here are `same_spk_pause`, `diff_spk_pause`,
`diff_spk_overlap`, and `prob_diff_spk_overlap`.

More simulation methods will be implemented in the future.

### Implementation

All simulation algorithms are implemented in their own classes that are
derived from an abstract class `BaseMeetingSimulator`. The classes
should implement the methods `fit()`, `simulate()`, and `reverberate()`:

* `fit()` learns the parameters of the simulator based on a provided
`SupervisionSet`, or sensible defaults if none is provided.
* `simulate()` generates meetings. Each meeting is a `MixedCut` object,
where each track represents a specific source (speaker). The track
itself may be a `MonoCut` (if there is only 1 utterance for the speaker)
or a `MixedCut` (if multiple utterances are concatenated with padding).
The input to `simulate()` is a `CutSet` containing single utterances
that will be used for simulation. The number of meetings is controlled
either by a desired number of meetings to be generated (`num_meetings`),
or the number of times each utterance will be used (`num_repeats`).
* `reverberate()` convolves the meeting with RIRs. Either some external
RIRs can be provided, or we will generate RIRs on-the-fly using a fast
random approximation (see
#799).

### Usage

1. Python

```python
from lhotse.workflows.meeting_simulation import ConversationalMeetingSimulator

# "sups" is a SupervisionSet for a meeting-style dataset such as AMI
# "cuts" is a CutSet containing single utterances, e.g., LibriSpeech

simulator = ConversationalMeetingSimulator()
simulator.fit(sups)
meetings = simulator.simulate(cuts, num_repeats=5)
meetings = simulator.reverberate(cuts)
```

2. CLI

```
lhotse workflows simulate-meetings cuts.jsonl.gz out_cuts.jsonl.gz -m conversational -f sups.jsonl.gz -r 5 --reverberate 
```

For full CLI usage, see `lhotse workflows simulate-meetings --help`.

### Example

We generated mixtures using the LibriSpeech `dev-clean` subset
(`num_repeat=2, num_speakers_per_meeting=(3,4)`), based on the
distribution learned from AMI dev set.

**Method = speaker-independent**

```
$ lhotse workflows simulate-meetings data/manifests/librispeech_cuts_trimmed_dev-clean.jsonl.gz mix_si.jsonl.gz -m independent -f ami-sdm_supervisions_dev.jsonl.gz -r 2 -s 3,4 --reverberate

Fitting the meeting simulator to the provided supervisions...
Learned parameters: loc=0.00, scale=7.77
Simulating meetings...
895it [00:02, 399.64it/s]
Reverberating the simulated meetings...
Saving the simulated meetings...
```

**Method = conversational**
```
$ lhotse workflows simulate-meetings data/manifests/librispeech_cuts_trimmed_dev-clean.jsonl.gz mix_conv.jsonl.gz -m conversational -f ami-sdm_supervisions_dev.jsonl.gz -r 2 -s 3,4 --reverberate

Fitting the meeting simulator to the provided supervisions...
Learned parameters: ConversationalMeetingSimulator (same_spk_pause=8.66, diff_spk_pause=1.55, diff_spk_overlap=1.62), prob_diff_spk_overlap=0.55)
Simulating meetings...
881it [00:02, 339.92it/s]
Reverberating the simulated meetings...
Saving the simulated meetings...
```

Statistics of generated meetings:

**Method = speaker-independent**
```
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count:               │ 895      │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 13:13:55 │
├───────────────────────────┼──────────┤
│ mean                      │ 53.2     │
├───────────────────────────┼──────────┤
│ std                       │ 17.2     │
├───────────────────────────┼──────────┤
│ min                       │ 15.4     │
├───────────────────────────┼──────────┤
│ 25%                       │ 40.9     │
├───────────────────────────┼──────────┤
│ 50%                       │ 51.3     │
├───────────────────────────┼──────────┤
│ 75%                       │ 62.2     │
├───────────────────────────┼──────────┤
│ 99%                       │ 103.8    │
├───────────────────────────┼──────────┤
│ 99.5%                     │ 111.6    │
├───────────────────────────┼──────────┤
│ 99.9%                     │ 124.3    │
├───────────────────────────┼──────────┤
│ max                       │ 138.4    │
├───────────────────────────┼──────────┤
│ Recordings available:     │ 895      │
├───────────────────────────┼──────────┤
│ Features available:       │ 0        │
├───────────────────────────┼──────────┤
│ Supervisions available:   │ 11024    │
╘═══════════════════════════╧══════════╛
Speech duration statistics:
╒══════════════════════════════╤══════════╤═══════════════════════════╕
│ Total speech duration        │ 07:11:33 │ 54.36% of recording       │
├──────────────────────────────┼──────────┼───────────────────────────┤
│ Total speaking time duration │ 09:35:25 │ 72.48% of recording       │
├──────────────────────────────┼──────────┼───────────────────────────┤
│ Total silence duration       │ 06:02:22 │ 45.64% of recording       │
├──────────────────────────────┼──────────┼───────────────────────────┤
│ Single-speaker duration      │ 05:08:31 │ 38.86% (71.49% of speech) │
├──────────────────────────────┼──────────┼───────────────────────────┤
│ Overlapped speech duration   │ 02:03:03 │ 15.50% (28.51% of speech) │
╘══════════════════════════════╧══════════╧═══════════════════════════╛
Speech duration statistics by number of speakers:
╒══════════════════════╤═══════════════════════╤════════════════════════════╤═══════════════╤══════════════════════╕
│ Number of speakers   │ Duration (hh:mm:ss)   │ Speaking time (hh:mm:ss)   │ % of speech   │ % of speaking time   │
╞══════════════════════╪═══════════════════════╪════════════════════════════╪═══════════════╪══════════════════════╡
│ 1                    │ 05:08:31              │ 05:08:31                   │ 71.49%        │ 53.61%               │
├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤
│ 2                    │ 01:43:20              │ 03:26:40                   │ 23.94%        │ 35.92%               │
├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤
│ 3                    │ 00:18:35              │ 00:55:43                   │ 4.30%         │ 9.68%                │
├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤
│ 4                    │ 00:01:08              │ 00:04:32                   │ 0.26%         │ 0.79%                │
├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤
│ Total                │ 07:11:33              │ 09:35:25                   │ 100.00%       │ 100.00%              │
╘══════════════════════╧═══════════════════════╧════════════════════════════╧═══════════════╧══════════════════════╛

```

**Method = conversational**
```
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count:               │ 881      │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 15:32:42 │
├───────────────────────────┼──────────┤
│ mean                      │ 63.5     │
├───────────────────────────┼──────────┤
│ std                       │ 36.7     │
├───────────────────────────┼──────────┤
│ min                       │ 21.9     │
├───────────────────────────┼──────────┤
│ 25%                       │ 43.5     │
├───────────────────────────┼──────────┤
│ 50%                       │ 53.6     │
├───────────────────────────┼──────────┤
│ 75%                       │ 69.9     │
├───────────────────────────┼──────────┤
│ 99%                       │ 221.1    │
├───────────────────────────┼──────────┤
│ 99.5%                     │ 237.6    │
├───────────────────────────┼──────────┤
│ 99.9%                     │ 327.3    │
├───────────────────────────┼──────────┤
│ max                       │ 368.0    │
├───────────────────────────┼──────────┤
│ Recordings available:     │ 881      │
├───────────────────────────┼──────────┤
│ Features available:       │ 0        │
├───────────────────────────┼──────────┤
│ Supervisions available:   │ 11024    │
╘═══════════════════════════╧══════════╛
Speech duration statistics:
╒══════════════════════════════╤══════════╤═══════════════════════════╕
│ Total speech duration        │ 08:24:60 │ 54.14% of recording       │
├──────────────────────────────┼──────────┼───────────────────────────┤
│ Total speaking time duration │ 09:35:25 │ 61.69% of recording       │
├──────────────────────────────┼──────────┼───────────────────────────┤
│ Total silence duration       │ 07:07:43 │ 45.86% of recording       │
├──────────────────────────────┼──────────┼───────────────────────────┤
│ Single-speaker duration      │ 07:15:21 │ 46.68% (86.21% of speech) │
├──────────────────────────────┼──────────┼───────────────────────────┤
│ Overlapped speech duration   │ 01:09:39 │ 7.47% (13.79% of speech)  │
╘══════════════════════════════╧══════════╧═══════════════════════════╛
Speech duration statistics by number of speakers:
╒══════════════════════╤═══════════════════════╤════════════════════════════╤═══════════════╤══════════════════════╕
│ Number of speakers   │ Duration (hh:mm:ss)   │ Speaking time (hh:mm:ss)   │ % of speech   │ % of speaking time   │
╞══════════════════════╪═══════════════════════╪════════════════════════════╪═══════════════╪══════════════════════╡
│ 1                    │ 07:15:21              │ 07:15:21                   │ 86.21%        │ 75.66%               │
├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤
│ 2                    │ 01:08:52              │ 02:17:44                   │ 13.64%        │ 23.93%               │
├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤
│ 3                    │ 00:00:48              │ 00:02:22                   │ 0.16%         │ 0.41%                │
├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤
│ 4                    │ 00:00:00              │ 00:00:00                   │ 0.00%         │ 0.00%                │
├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤
│ Total                │ 08:24:60              │ 09:35:25                   │ 100.00%       │ 100.00%              │
╘══════════════════════╧═══════════════════════╧════════════════════════════╧═══════════════╧══════════════════════╛
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants