Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Silence Sampling Algorithm for ASR Multi-speaker Data Simulator #5897

Merged
merged 75 commits into from
Feb 12, 2023

Conversation

stevehuang52
Copy link
Collaborator

@stevehuang52 stevehuang52 commented Jan 31, 2023

Signed-off-by: stevehuang52 [email protected]

What does this PR do ?

Replace the previous silence insertion method with a new one that guarantees close approximation to specified mean_silence

Collection: [ASR]

Adding Silence in ASR Data Simulator

Requirements:

  1. Sentence durations in each session follow a negative-binomial (NB) distribution
  2. Silence ratios in all sessions follows a Beta distribution (range in [0.1])
  3. Per silence length should look like a long tail distribution, thus Gamma distribution is used.
  4. Silence Uniformity: each speech sentence (overlaps combined) should be followed by some minimum silence
  5. Silence Variablity: per-silence durations and sentence durations should be approximately independent (i.e., p-value close to 0)

Parameters:

  • NUM_SESSIONS: number of sessions
  • MAX_SESS_DUR: maximum session duration
  • SAMPLING_RATE: sampling rate for audio
  • [NB_COUNT, NB_PROB]: parameters for per sentence duration distribution
  • SILENCE_RATIO_MEAN: mean for target silence ratio in all sessions, in (0,1)
  • SILENCE_RATIO_VAR: std for target silence ratio in all sessions, set small values (e.g., 0.1) for better approximation to mean, set larger (e.g., 2.0) for more diversity in silence.
  • PER_SILENCE_VAR: std for individual silence length, default to 20 for achieving p-value=0.1 to de-correlate speech and silence lengths
  • [PER_SILENCE_MIN,PER_SILENCE_MAX]: mix and max of per silence duration in seconds, max=-1 for no constraint

Algorithm:

MAX_SESSION_LEN = MAX_SESS_DUR * SAMPLING_RATE
MIN_SILENCE_LEN = SILENCE_RATIO_MIN * SAMPLING_RATE
MAX_SILENCE_LEN = min(SILENCE_RATIO_MAX * SAMPLING_RATE, MAX_SESSION_LEN)

sessions = []
for i in range(NUM_SESSIONS):
    curr_session = []

    curr_sess_len = 0
    curr_speech_len = 0
    curr_silence_len = 0

    a = SILENCE_RATIO_MEAN ** 2 * (1 - SILENCE_RATIO_MEAN) / SILENCE_RATIO_VAR - SILENCE_RATIO_MEAN
    b = SILENCE_RATIO_MEAN * (1 - SILENCE_RATIO_MEAN) ** 2 / SILENCE_RATIO_VAR - (1 - SILENCE_RATIO_MEAN)
    sess_silence_mean = Beta(a, b).rvs()

    while curr_sess_len < MAX_SESSION_LEN:
        speech_len = NB(NB_COUNT, NB_PROB).rvs()
        sentence = build_sentence(speech_len,curr_sess_len,  MAX_SESSION_LEN)
        
        curr_session += sentence
        curr_sess_len += len(sentence)
        curr_speech_len += len(sentence)

        if curr_sess_len >= MAX_SESSION_LEN:
            break
        
        # dynamically adjust silence mean to achieve the overall mean
        silence_mean = max(1, MIN_SIL1ENCE_LEN, (sess_silence_mean * curr_sess_len - curr_silence_len) / (1 - sess_silence_mean))

        # sampling with large std to de-correlate with previous sentence length
        silence_len = Gamma(a=silence_mean ** 2 / PER_SILENCE_VAR, scale=PER_SILENCE_VAR / silence_mean).rvs()  

        # enforce valid length
        silence_len = min(max(MIN_SILENCE_LEN, silence_len), MAX_SILENCE_LEN, max_session_len - curr_sess_len)  
        
        silence = add_silence(silence_len)
        
        curr_session += silence
        curr_sess_len += silence_len
        curr_silence_len += silence_len

    sessions.append(curr_session)

Notes

  • Harder to get desired silence ratio with shorter session length
    • E.g., sess_len=20s, mean_silence=0.35, 500 hours -> actual ratio 0.24
    • E.g., sess_len=120s. mean_silence=0.3, 100 hours -> actual ratio ~0.3
  • Silence lengths distribution is approximately exponential, and reducing silence ratio can reduce the avg silence length
    • E.g., mean_silence=0.1, std=0.05 -> mean per-silence length 0.2s
    • E.g., mean_silence=0.2, std=0.5 -> mean per-silence length 2.4s

Signed-off-by: stevehuang52 <[email protected]>
@tango4j tango4j changed the title Fix Silence Insertion for ASR Data Simulator Fix Silence Sampling Algorithm for ASR Multi-speaker Data Simulator Jan 31, 2023
stevehuang52 and others added 22 commits February 2, 2023 12:51
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
stevehuang52 and others added 18 commits February 7, 2023 16:43
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
@tango4j
Copy link
Collaborator

tango4j commented Feb 10, 2023

Notebook is tested, works with no problem. the new script was missing a license template, so I added. I will approve as soon as it passes the test.

Copy link
Collaborator

@tango4j tango4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebooks and data simulation code both tested. Very nice work by stevehuang, thanks.

@stevehuang52 stevehuang52 merged commit 63f6d44 into NVIDIA:main Feb 12, 2023
titu1994 pushed a commit to titu1994/NeMo that referenced this pull request Mar 24, 2023
…ulator (NVIDIA#5897)

* fix silence insertioon

Signed-off-by: stevehuang52 <[email protected]>

* update docs and tutorial

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* change to beta annd gamma distributions

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Added silence vs overlap selector with overlap algo

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Function name change and fixes

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update silence and overlap adding algorithm for better accuracy

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Recommended range for overlap mean

Signed-off-by: Taejin Park <[email protected]>

* Changing yaml file default values

Signed-off-by: Taejin Park <[email protected]>

* Fixed typos and errors in docstrings

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed minor bugs and removed unused functions

Signed-off-by: Taejin Park <[email protected]>

* Fixed minor bugs and removed unused imports

Signed-off-by: Taejin Park <[email protected]>

* Added docstrings for newly updated overlap algos

Signed-off-by: Taejin Park <[email protected]>

* Fixed non_silence_len_samples calculation, more accurate now

Signed-off-by: Taejin Park <[email protected]>

* adding missing docstring for non_silence_len

Signed-off-by: Taejin Park <[email protected]>

* removed ipdb lines

Signed-off-by: Taejin Park <[email protected]>

* refactor and update

Signed-off-by: stevehuang52 <[email protected]>

* updated logs for v1.1

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Argument check update for mean=0 var=0 case

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* update silence/overlap mean clipping

Signed-off-by: stevehuang52 <[email protected]>

* Adding mean clipping

Signed-off-by: Taejin Park <[email protected]>

* added 0 handling for ovl/sim_mean

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Tested on fisher and fixed the bug with string-speaker ID

Signed-off-by: Taejin Park <[email protected]>

* update code for visualization

Signed-off-by: stevehuang52 <[email protected]>

* refactor

Signed-off-by: stevehuang52 <[email protected]>

* fix load_rttm

Signed-off-by: stevehuang52 <[email protected]>

* Adding docstrings

Signed-off-by: Taejin Park <[email protected]>

* Adding usage in the analysis script

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix filename

Signed-off-by: stevehuang52 <[email protected]>

* Added argument check for sentence length params

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed unnecessary NB torch sampling

Signed-off-by: Taejin Park <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add build_synthetic_vad_manifest.py

Signed-off-by: stevehuang52 <[email protected]>

* add check for non rttm files

Signed-off-by: stevehuang52 <[email protected]>

* added docstrings

Signed-off-by: Taejin Park <[email protected]>

* typo is fixed

Signed-off-by: Taejin Park <[email protected]>

* License template was missing, added

Signed-off-by: Taejin Park <[email protected]>

* add missing copyright and move script

Signed-off-by: stevehuang52 <[email protected]>

* add missing comma

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants