Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update fvad doc #6920

Merged
merged 6 commits into from
Jun 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions examples/asr/conf/vad/frame_vad_infer_postprocess.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ vad:
postprocessing:
onset: 0.3 # onset threshold for detecting the beginning and end of a speech
offset: 0.3 # offset threshold for detecting the end of a speech.
pad_onset: 0.5 # adding durations before each speech segment
pad_offset: 0.5 # adding durations after each speech segment
min_duration_on: 0.0 # threshold for short speech deletion
min_duration_off: 0.6 # threshold for short non-speech segment deletion
pad_onset: 0.2 # adding durations before each speech segment
pad_offset: 0.2 # adding durations after each speech segment
min_duration_on: 0.2 # threshold for short speech deletion
min_duration_off: 0.2 # threshold for short non-speech segment deletion
filter_speech_first: True

prepared_manifest_vad_input: null # if not specify, it will automatically generated be "manifest_vad_input.json"
Expand Down
17 changes: 17 additions & 0 deletions examples/asr/speech_classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,20 @@ The manifest json file should have the following format (each line is a Python d
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
```


## Visualization

To visualize the VAD outputs, you can use the `nemo.collections.asr.parts.utils.vad_utils.plot_sample_from_rttm` function, which takes an audio file and an RTTM file as input, and plots the audio waveform and the VAD labels. Since the VAD inference script will output a json manifest `manifest_vad_out.json` by default, you can create a Jupyter Notebook with the following script and fill in the paths using the output manifest:
```python
from nemo.collections.asr.parts.utils.vad_utils import plot_sample_from_rttm

plot_sample_from_rttm(
audio_file="/path/to/audio_file.wav",
rttm_file="/path/to/rttm_file.rttm",
offset=0.0,
duration=1000,
save_path="vad_pred.png"
)
```

6 changes: 3 additions & 3 deletions nemo/collections/asr/parts/utils/vad_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1648,7 +1648,7 @@ def frame_vad_infer_load_manifest(cfg: DictConfig):
manifest_orig.append(entry)

# always prefer RTTM labels if exist
if "label" not in entry or "rttm_filepath" in entry or "rttm_file" in entry:
if "label" not in entry and ("rttm_filepath" in entry or "rttm_file" in entry):
rttm_key = "rttm_filepath" if "rttm_filepath" in entry else "rttm_file"
segments = load_speech_segments_from_rttm(entry[rttm_key])
label_str = get_frame_labels(
Expand All @@ -1661,8 +1661,8 @@ def frame_vad_infer_load_manifest(cfg: DictConfig):
key_labels_map[uniq_audio_name] = [float(x) for x in label_str.split()]
elif entry.get("label", None) is not None:
key_labels_map[uniq_audio_name] = [float(x) for x in entry["label"].split()]
else:
raise ValueError("Must have either `label` or `rttm_filepath` in manifest")
elif cfg.evaluate:
raise ValueError("Must have either `label` or `rttm_filepath` in manifest when evaluate=True")

return manifest_orig, key_labels_map, key_rttm_map

Expand Down