-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reverb with fast RIR generator #799
Conversation
Some time comparisons:
In [14]: %timeit reverb_cut_rir(c,r)
The slowest run took 8.14 times longer than the fastest. This could mean that an intermediate result is being cached.
148 ms ± 88.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit reverb_cut(c)
179 ms ± 57.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit reverb_cut(c)
138 ms ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit reverb_cut(c)
281 ms ± 121 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Based on above, I'm keeping the torch operations with cached resamplers. |
Wow, nice speed up on PyTorch side. Let’s fix the style checks and LGTM! |
Desh do you have any insight how much simulated vs real RIR matters for tasks such as ASR or diarization? I wonder if given this implementation it still makes sense to use RIR datasets. The audio sample you uploaded sounds nice actually. |
I haven't run comparisons myself yet, but at least in the paper it seems to match the results using real RIRs on speech enhancement tasks. Also, older studies have found that at least for ASR, simulated vs real RIRs do not make much difference, so I would imagine that this method should be comparable with those. In any case, I am hoping to prepare an AMI recipe for icefall soon, and I can conduct such a comparison then. |
Thanks! |
## [workflow] Multi-talker meeting simulation A new workflow is implemented that can be used to create multi-talker meeting-like mixtures using single utterances. For context, see discussion in #823. ### Methods Currently, we have implemented 2 methods for simulation. 1. **Speaker-independent:** This is based on the simulation algorithm described in [the original EEND paper from Hitachi](https://arxiv.org/abs/1909.06247). The idea is to simulate each speaker's track independently of others by concatenating utterances with pauses sampled from an exponential distribution. All tracks are then mixed together. The parameters here are the `loc` and `scale` of the exponential distribution to sample the pauses. 2. **Conversational:** This is based on the algorithm described in [this paper from BUT](https://arxiv.org/abs/2204.00890). Instead of simulating each speaker track independently, we first sample N utterances and shuffle them. They are then mixed together by sampling pause or overlap duration from a randomly initialized gamma distribution, or from learned histograms. The parameters here are `same_spk_pause`, `diff_spk_pause`, `diff_spk_overlap`, and `prob_diff_spk_overlap`. More simulation methods will be implemented in the future. ### Implementation All simulation algorithms are implemented in their own classes that are derived from an abstract class `BaseMeetingSimulator`. The classes should implement the methods `fit()`, `simulate()`, and `reverberate()`: * `fit()` learns the parameters of the simulator based on a provided `SupervisionSet`, or sensible defaults if none is provided. * `simulate()` generates meetings. Each meeting is a `MixedCut` object, where each track represents a specific source (speaker). The track itself may be a `MonoCut` (if there is only 1 utterance for the speaker) or a `MixedCut` (if multiple utterances are concatenated with padding). The input to `simulate()` is a `CutSet` containing single utterances that will be used for simulation. The number of meetings is controlled either by a desired number of meetings to be generated (`num_meetings`), or the number of times each utterance will be used (`num_repeats`). * `reverberate()` convolves the meeting with RIRs. Either some external RIRs can be provided, or we will generate RIRs on-the-fly using a fast random approximation (see #799). ### Usage 1. Python ```python from lhotse.workflows.meeting_simulation import ConversationalMeetingSimulator # "sups" is a SupervisionSet for a meeting-style dataset such as AMI # "cuts" is a CutSet containing single utterances, e.g., LibriSpeech simulator = ConversationalMeetingSimulator() simulator.fit(sups) meetings = simulator.simulate(cuts, num_repeats=5) meetings = simulator.reverberate(cuts) ``` 2. CLI ``` lhotse workflows simulate-meetings cuts.jsonl.gz out_cuts.jsonl.gz -m conversational -f sups.jsonl.gz -r 5 --reverberate ``` For full CLI usage, see `lhotse workflows simulate-meetings --help`. ### Example We generated mixtures using the LibriSpeech `dev-clean` subset (`num_repeat=2, num_speakers_per_meeting=(3,4)`), based on the distribution learned from AMI dev set. **Method = speaker-independent** ``` $ lhotse workflows simulate-meetings data/manifests/librispeech_cuts_trimmed_dev-clean.jsonl.gz mix_si.jsonl.gz -m independent -f ami-sdm_supervisions_dev.jsonl.gz -r 2 -s 3,4 --reverberate Fitting the meeting simulator to the provided supervisions... Learned parameters: loc=0.00, scale=7.77 Simulating meetings... 895it [00:02, 399.64it/s] Reverberating the simulated meetings... Saving the simulated meetings... ``` **Method = conversational** ``` $ lhotse workflows simulate-meetings data/manifests/librispeech_cuts_trimmed_dev-clean.jsonl.gz mix_conv.jsonl.gz -m conversational -f ami-sdm_supervisions_dev.jsonl.gz -r 2 -s 3,4 --reverberate Fitting the meeting simulator to the provided supervisions... Learned parameters: ConversationalMeetingSimulator (same_spk_pause=8.66, diff_spk_pause=1.55, diff_spk_overlap=1.62), prob_diff_spk_overlap=0.55) Simulating meetings... 881it [00:02, 339.92it/s] Reverberating the simulated meetings... Saving the simulated meetings... ``` Statistics of generated meetings: **Method = speaker-independent** ``` Cut statistics: ╒═══════════════════════════╤══════════╕ │ Cuts count: │ 895 │ ├───────────────────────────┼──────────┤ │ Total duration (hh:mm:ss) │ 13:13:55 │ ├───────────────────────────┼──────────┤ │ mean │ 53.2 │ ├───────────────────────────┼──────────┤ │ std │ 17.2 │ ├───────────────────────────┼──────────┤ │ min │ 15.4 │ ├───────────────────────────┼──────────┤ │ 25% │ 40.9 │ ├───────────────────────────┼──────────┤ │ 50% │ 51.3 │ ├───────────────────────────┼──────────┤ │ 75% │ 62.2 │ ├───────────────────────────┼──────────┤ │ 99% │ 103.8 │ ├───────────────────────────┼──────────┤ │ 99.5% │ 111.6 │ ├───────────────────────────┼──────────┤ │ 99.9% │ 124.3 │ ├───────────────────────────┼──────────┤ │ max │ 138.4 │ ├───────────────────────────┼──────────┤ │ Recordings available: │ 895 │ ├───────────────────────────┼──────────┤ │ Features available: │ 0 │ ├───────────────────────────┼──────────┤ │ Supervisions available: │ 11024 │ ╘═══════════════════════════╧══════════╛ Speech duration statistics: ╒══════════════════════════════╤══════════╤═══════════════════════════╕ │ Total speech duration │ 07:11:33 │ 54.36% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total speaking time duration │ 09:35:25 │ 72.48% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total silence duration │ 06:02:22 │ 45.64% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Single-speaker duration │ 05:08:31 │ 38.86% (71.49% of speech) │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Overlapped speech duration │ 02:03:03 │ 15.50% (28.51% of speech) │ ╘══════════════════════════════╧══════════╧═══════════════════════════╛ Speech duration statistics by number of speakers: ╒══════════════════════╤═══════════════════════╤════════════════════════════╤═══════════════╤══════════════════════╕ │ Number of speakers │ Duration (hh:mm:ss) │ Speaking time (hh:mm:ss) │ % of speech │ % of speaking time │ ╞══════════════════════╪═══════════════════════╪════════════════════════════╪═══════════════╪══════════════════════╡ │ 1 │ 05:08:31 │ 05:08:31 │ 71.49% │ 53.61% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 2 │ 01:43:20 │ 03:26:40 │ 23.94% │ 35.92% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 3 │ 00:18:35 │ 00:55:43 │ 4.30% │ 9.68% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 4 │ 00:01:08 │ 00:04:32 │ 0.26% │ 0.79% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ Total │ 07:11:33 │ 09:35:25 │ 100.00% │ 100.00% │ ╘══════════════════════╧═══════════════════════╧════════════════════════════╧═══════════════╧══════════════════════╛ ``` **Method = conversational** ``` Cut statistics: ╒═══════════════════════════╤══════════╕ │ Cuts count: │ 881 │ ├───────────────────────────┼──────────┤ │ Total duration (hh:mm:ss) │ 15:32:42 │ ├───────────────────────────┼──────────┤ │ mean │ 63.5 │ ├───────────────────────────┼──────────┤ │ std │ 36.7 │ ├───────────────────────────┼──────────┤ │ min │ 21.9 │ ├───────────────────────────┼──────────┤ │ 25% │ 43.5 │ ├───────────────────────────┼──────────┤ │ 50% │ 53.6 │ ├───────────────────────────┼──────────┤ │ 75% │ 69.9 │ ├───────────────────────────┼──────────┤ │ 99% │ 221.1 │ ├───────────────────────────┼──────────┤ │ 99.5% │ 237.6 │ ├───────────────────────────┼──────────┤ │ 99.9% │ 327.3 │ ├───────────────────────────┼──────────┤ │ max │ 368.0 │ ├───────────────────────────┼──────────┤ │ Recordings available: │ 881 │ ├───────────────────────────┼──────────┤ │ Features available: │ 0 │ ├───────────────────────────┼──────────┤ │ Supervisions available: │ 11024 │ ╘═══════════════════════════╧══════════╛ Speech duration statistics: ╒══════════════════════════════╤══════════╤═══════════════════════════╕ │ Total speech duration │ 08:24:60 │ 54.14% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total speaking time duration │ 09:35:25 │ 61.69% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total silence duration │ 07:07:43 │ 45.86% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Single-speaker duration │ 07:15:21 │ 46.68% (86.21% of speech) │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Overlapped speech duration │ 01:09:39 │ 7.47% (13.79% of speech) │ ╘══════════════════════════════╧══════════╧═══════════════════════════╛ Speech duration statistics by number of speakers: ╒══════════════════════╤═══════════════════════╤════════════════════════════╤═══════════════╤══════════════════════╕ │ Number of speakers │ Duration (hh:mm:ss) │ Speaking time (hh:mm:ss) │ % of speech │ % of speaking time │ ╞══════════════════════╪═══════════════════════╪════════════════════════════╪═══════════════╪══════════════════════╡ │ 1 │ 07:15:21 │ 07:15:21 │ 86.21% │ 75.66% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 2 │ 01:08:52 │ 02:17:44 │ 13.64% │ 23.93% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 3 │ 00:00:48 │ 00:02:22 │ 0.16% │ 0.41% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 4 │ 00:00:00 │ 00:00:00 │ 0.00% │ 0.00% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ Total │ 08:24:60 │ 09:35:25 │ 100.00% │ 100.00% │ ╘══════════════════════╧═══════════════════════╧════════════════════════════╧═══════════════╧══════════════════════╛ ```
This PR adds the functionality to synthesize a room impulse response on-the-fly for using the
reverb_rir()
methods in theCut
orRecording
(see discussion in #787). The algorithm is based on this paper. Here is an audio sample: Google drive link.Usage:
Basically, all the same methods would still work, but will use the random RIR generator if no RIR recording is explicitly provided.