Skip to content

Commit

Permalink
Fix typos (#7361)
Browse files Browse the repository at this point in the history
* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>
  • Loading branch information
omahs authored and yaoyu-33 committed Sep 5, 2023
1 parent a47c1e3 commit 888ef06
Show file tree
Hide file tree
Showing 9 changed files with 19 additions and 19 deletions.
4 changes: 2 additions & 2 deletions examples/asr/asr_adapters/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ For further discussion of what are adapters, how they are trained and how are th

Using the `train_asr_adapter.py` script, you can provide the path to a pre-trained model, a config to define and add an adapter module to this pre-trained model, some information to setup datasets for training / validation - and then easily add any number of adapter modules to this network.

**Note**: In order to train multiple adapters on a single model, provide the `model.nemo_model` (in the config) to be a previously adapted model ! Ensure that you use a new unique `model.adapter.adapter_name` in the config.
**Note**: In order to train multiple adapters on a single model, provide the `model.nemo_model` (in the config) to be a previously adapted model! Ensure that you use a new unique `model.adapter.adapter_name` in the config.

## Training execution flow diagram

Expand Down Expand Up @@ -57,7 +57,7 @@ graph TD
Ho --> I["trainer.test(model)"]
```

**Note**: If you with to evaluate the base model (with all adapters disabled), simply pass `model.adapter.adapter_name=null` to the config of this script to disable all adapters and evaluate just the base model.
**Note**: If you wish to evaluate the base model (with all adapters disabled), simply pass `model.adapter.adapter_name=null` to the config of this script to disable all adapters and evaluate just the base model.

# Scoring and Analysis of Results

Expand Down
2 changes: 1 addition & 1 deletion examples/asr/asr_chunked_inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Contained within this directory are scripts to perform streaming or buffered inf

## Difference between streaming and buffered ASR

While we primarily showcase the defalts of these models in buffering mode, note that the major difference between streaming ASR and buffered ASR is the chunk size and the total context buffer size.
While we primarily showcase the defaults of these models in buffering mode, note that the major difference between streaming ASR and buffered ASR is the chunk size and the total context buffer size.

If you reduce your chunk size, the latency for your first prediction is reduced, and the model appears to predict the text with shorter delay. On the other hand, since the amount of information in the chunk is reduced, it causes higher WER.

Expand Down
4 changes: 2 additions & 2 deletions examples/slu/speech_intent_slot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This example shows how to train an end-to-end model for spoken language understa
We present the main results of our models in the following table.
| | | | **Intent (Scenario_Action)** | | **Entity** | | | **SLURP Metrics** | |
|--------------------------------------------------|----------------|--------------------------|------------------------------|---------------|------------|--------|--------------|-------------------|---------------------|
| **Model** | **Params (M)** | **Pretrained** | **Accuracy** | **Precision** | **Recall** | **F1** | **Precsion** | **Recall** | **F1** |
| **Model** | **Params (M)** | **Pretrained** | **Accuracy** | **Precision** | **Recall** | **F1** | **Precision** | **Recall** | **F1** |
| NeMo-Conformer-Transformer-Large | 127 | NeMo ASR-Set 3.0 | 90.14 | 86.46 | 82.29 | 84.33 | 84.31 | 80.33 | 82.27 |
| NeMo-Conformer-Transformer-Large | 127 | NeMo SSL-LL60kh | 89.04 | 73.19 | 71.8 | 72.49 | 77.9 | 76.65 | 77.22 |
| NeMo-Conformer-Transformer-Large | 127 | None | 72.56 | 43.19 | 43.5 | 43.34 | 53.59 | 53.92 | 53.76 |
Expand Down Expand Up @@ -125,4 +125,4 @@ The pretrained models and directions on how to use them are available [here](htt
[7] [Libri-Light: A Benchmark for ASR with Limited or No Supervision](https://arxiv.org/abs/1912.07875)

## Acknowledgments
The evaluation code is borrowed from the official [SLURP package](https://github.com/pswietojanski/slurp/tree/master/scripts/evaluation), and some data processing code is adapted from [SpeechBrain SLURP Recipe](https://github.com/speechbrain/speechbrain/tree/develop/recipes/SLURP).
The evaluation code is borrowed from the official [SLURP package](https://github.com/pswietojanski/slurp/tree/master/scripts/evaluation), and some data processing code is adapted from [SpeechBrain SLURP Recipe](https://github.com/speechbrain/speechbrain/tree/develop/recipes/SLURP).
2 changes: 1 addition & 1 deletion examples/speaker_tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Speaker tasks in general are broadly classified into two tasks:
- [Speaker Recognition](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html)
- [Speaker Diarization](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html)

**Speaker Recognition** is a research area which solves two major tasks: speaker identification (what is the identity of the speaker?) and speaker verification (is the speaker who they claim to be?). where as **Speaker Diarization** is a task segmenting audio recordings by speaker labels (Who Speaks When?).
**Speaker Recognition** is a research area which solves two major tasks: speaker identification (what is the identity of the speaker?) and speaker verification (is the speaker who they claim to be?). Whereas **Speaker Diarization** is a task segmenting audio recordings by speaker labels (Who Speaks When?).

In *recognition* folder we provide scripts for training, inference and verification of audio samples.
In *diarization* folder we provide scripts for inference of speaker diarization using pretrained VAD (optional) and Speaker embedding extractor models
2 changes: 1 addition & 1 deletion examples/speaker_tasks/diarization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Mandatory fields are `audio_filepath`, `offset`, `duration`, `label:"infer"` and

Some of important options in config file:

- **`diarizer.vad.model_path`: voice activity detection modle name or path to the model**
- **`diarizer.vad.model_path`: voice activity detection model name or path to the model**

Specify the name of VAD model, then the script will download the model from NGC. Currently, we have 'vad_multilingual_marblenet', 'vad_marblenet' and 'vad_telephony_marblenet' as options for VAD models.

Expand Down
4 changes: 2 additions & 2 deletions tools/nemo_forced_aligner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ NFA is a tool for generating token-, word- and segment-level timestamps of speec

## Quickstart
1. Install [NeMo](https://github.com/NVIDIA/NeMo#installation).
2. Prepare a NeMo-style manifest containing the paths of audio files you would like to proces, and (optionally) their text.
2. Prepare a NeMo-style manifest containing the paths of audio files you would like to process, and (optionally) their text.
3. Run NFA's `align.py` script with the desired config, e.g.:
``` bash
python <path_to_NeMo>/tools/nemo_forced_aligner/align.py \
Expand All @@ -27,4 +27,4 @@ NFA is a tool for generating token-, word- and segment-level timestamps of speec
</p>

## Documentation
More documentation is available [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/nemo_forced_aligner.html).
More documentation is available [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/nemo_forced_aligner.html).
4 changes: 2 additions & 2 deletions tools/nmt_grpc_service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ python server.py --model models/en-es.nemo --model models/en-de.nemo --model mod

If working with the outputs of a speech recognition system without punctuation and capitalization, you can provide the path to a .nemo model file that performs punctuation and capitalization ex: https://ngc.nvidia.com/catalog/models/nvidia:nemo:punctuation_en_bert via the `--punctuation_model` flag.

NOTE: The server will throw an error if NMT models do not have have src_language and tgt_language attributes.
NOTE: The server will throw an error if NMT models do not have src_language and tgt_language attributes.

## Notes

Expand Down Expand Up @@ -50,7 +50,7 @@ This will start a Riva Speech Recognition service and `nvidia-smi` should show `

Start the NeMo translation server using instructions in the previous section (with or without a punctuation and capitalization model).

Run the cascade client using a single channel audio wav file specifying the target language to translate into. By default, Riva ASR is in Englisha and so we specify only the target language to translate into.
Run the cascade client using a single channel audio wav file specifying the target language to translate into. By default, Riva ASR is in English and so we specify only the target language to translate into.

```bash
python asr_nmt_client.py --audio-file recording.mono.wav --asr_punctuation --target_language de
Expand Down
8 changes: 4 additions & 4 deletions tools/rir_corpus_generator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Each directory, e.g, `{train, dev, test}`, corresponds to a subset of data and c
- `num_sources`: number of sources simulated in this room
- `rir_rt60_theory`: Theoretically calculated RT60
- `rir_rt60_measured`: RT60 calculated from the generated RIRs, list with `num_sources` elements
- `mic_positions`: microphone postions in the room, list with `num_mics` elements
- `mic_positions`: microphone positions in the room, list with `num_mics` elements
- `mic_center`: center of the microphone array
- `source_position`: position of each source, list with `num_source` elements
- `source_distance`: distance of each source to microphone center, list with `num_source` elements
Expand Down Expand Up @@ -124,7 +124,7 @@ flowchart TD
G --> |Statistics| J[*.png]
```

Microphone signals are constructed by mixing target, backgoround noise and interference signal. This is illustrated in the following diagram for an example with two interfering sources:
Microphone signals are constructed by mixing target, background noise and interference signal. This is illustrated in the following diagram for an example with two interfering sources:

```mermaid
flowchart LR
Expand Down Expand Up @@ -186,7 +186,7 @@ OUTPUT_DIR
+--{train, dev, test}_info.png
```

Each directory, e.g, `{train, dev, test}`, corresponds to a subset of data and contain the `*.wav` files with the generated audio signals. Corresponding `*_manifest.json` files contain metadata for each subset. Each row corresponds to a single example/set of `*.wav` files and includes the following fields
Each directory, e.g, `{train, dev, test}`, corresponds to a subset of data and contains the `*.wav` files with the generated audio signals. Corresponding `*_manifest.json` files contain metadata for each subset. Each row corresponds to a single example/set of `*.wav` files and includes the following fields

- `audio_filepath`: path to the mic file
- `{tag}_filepath`: path to the corresponding signal component, such as `noise` or `interference`
Expand All @@ -202,4 +202,4 @@ Each directory, e.g, `{train, dev, test}`, corresponds to a subset of data and c

1. R. Scheibler, E. Bezzam, I. Dokmanić, Pyroomacoustics: A Python package for audio room simulations and array processing algorithms, Proc. IEEE ICASSP, Calgary, CA, 2018.

2. J. Eaton, N. D. Gaubitch, A. H. Moore, P. A. Naylor, The ACE challenge: Corpus description and performance evaluation, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 2015
2. J. Eaton, N. D. Gaubitch, A. H. Moore, P. A. Naylor, The ACE challenge: Corpus description and performance evaluation, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 2015
8 changes: 4 additions & 4 deletions tutorials/asr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,19 @@ In this repository, you will find several tutorials discussing what is Automatic

# Automatic Speech Recognition

1) `ASR_with_NeMo`: Discussion of the task of ASR, handling of data, understanding the acoutic features, using an Acoustic Model and train on an ASR dataset, and finally evaluating the model's performance.
1) `ASR_with_NeMo`: Discussion of the task of ASR, handling of data, understanding the acoustic features, using an Acoustic Model and train on an ASR dataset, and finally evaluating the model's performance.

2) `ASR_with_Subword_Tokenization`: Modern ASR models benefit from several improvements in neural network design and data processing. In this tutorial we discuss how we can use Tokenizers (commonly found in NLP) to significantly improve the efficiency of ASR models without sacrificing any accuracy during transcription.

3) `ASR_CTC_Language_Finetuning`: Until now, we have discussed how to train ASR models from scratch. Once we get pretrained ASR models, we can then fine-tune them on domain specific use cases, or even other languages ! This notebook discusses how to fine-tune an English ASR model onto another language, and discusses several methods to improve the efficiency of tranfer learning.
3) `ASR_CTC_Language_Finetuning`: Until now, we have discussed how to train ASR models from scratch. Once we get pretrained ASR models, we can then fine-tune them on domain specific use cases, or even other languages! This notebook discusses how to fine-tune an English ASR model onto another language, and discusses several methods to improve the efficiency of transfer learning.

4) `Online_ASR_Microphone_Demo`: A short notebook that enables us to speak into a microphone and transcribe speech in an online manner. Note that this is not the most efficient way to perform streaming ASR, and it is more of a demo.

5) `Offline_ASR`: ASR models are able to transcribe speech to text, however that text might be inaccurate. Here, we discuss how to leverage external language models build with KenLM to improve the accuracy of ASR transcriptions. Further, we discuss how we can extract time stamps from an ASR model with some heuristics.

6) `ASR_for_telephony_speech`: Audio sources are not homogenous, nor are the ways to store large audio datasets. Here, we discuss our observations and recommendations when working with audio obtained from Telephony speech sources.

7) `Online_Noise_Augmentation`: While academic datasets are useful for training ASR model, there can often be cases where such datasets are prestine and dont really represent the use case in the real world. So we discuss how to make the model more noise robust with Online audio augmentation.
7) `Online_Noise_Augmentation`: While academic datasets are useful for training ASR model, there can often be cases where such datasets are pristine and don't really represent the use case in the real world. So we discuss how to make the model more noise robust with Online audio augmentation.

8) `Intro_to_Transducers`: Previous tutorials discuss ASR models in context of the Connectionist Temporal Classification Loss. In this tutorial, we introduce the Transducer loss, and the components of this loss function that are constructed in the config file. This tutorial is a prerequisite to the `ASR_with_Transducers` tutorial.

Expand All @@ -37,7 +37,7 @@ In this repository, you will find several tutorials discussing what is Automatic

----------------

# Automatic Speech Recogntion with Adapters
# Automatic Speech Recognition with Adapters

Please refer to the `asr_adapter` sub-folder which contains tutorials on the use of `Adapter` modules to perform domain adaptation on ASR models, as well as its sub-domains.

Expand Down

0 comments on commit 888ef06

Please sign in to comment.