Skip to content

Commit

Permalink
update on Som comments
Browse files Browse the repository at this point in the history
Signed-off-by: Nithin Rao Koluguri <nithinraok>
  • Loading branch information
Nithin Rao Koluguri committed Feb 9, 2024
1 parent c92549b commit bedf0d7
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions tutorials/asr/Transducers_with_HF_Datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,12 @@
"source": [
"# Automatic Speech Recognition with Transducer Models using HF Datasets\n",
"\n",
"This notebook is a basic tutorial for creating a Transducer ASR model and training it on a small dataset from HF. \n",
"We have discussed training various ASR models in NeMo using custom datasets, either for fine-tuning or for scratch-training. In this tutorial, we will showcase how to use Hugging Face datasets library in order to finetune a Transducer ASR model on a small dataset from for the Telugu language. \n",
"It includes discussions relevant to preparing datasets with HF and how to use them to finetune NeMo models. The same method applies to training from scratch. However, for training, we recommend using scripts directly from the `examples/asr/` folder.\n",
"\n",
"In this tutorial, we demonstrate the usage of HF datasets for the Telugu language, where we use the Fluers dataset for training, validation, and testing. However, the same procedure can be used for other languages or domains and finetuned for specific use cases accordingly. \n",
"\n",
"For scripts, refer to `speech_to_text_finetune.py` for training from scratch. \n",
"For scripts, refer to [speech_to_text_finetune.py]('https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_to_text_finetune.py') for training from scratch. \n",
"\n",
"--------\n",
"\n",
Expand Down Expand Up @@ -125,7 +125,7 @@
"source": [
"Since we are finetuning Parakeet model, which is an English language model, we need to update the tokenizer and update the decoder to support the new language. \n",
"\n",
"First, we will extract text transcriptions from the dataset and use them to train a tokenizer. We will use the scripts from NeMo first to get the data from HF dataset using `get_hf_dataset.py` script then secondly use `process_asr_text_tokenizer.py` to prepare the tokenizer from [scripts](https://github.com/NVIDIA/NeMo/tree/main/scripts/tokenizers) folder. \n"
"First, we will extract text transcriptions from the dataset and use them to train a tokenizer. We will use the scripts from NeMo first to get the data from HF dataset using `get_hf_dataset.py` script. Next we use `process_asr_text_tokenizer.py` script to prepare the tokenizer from [scripts](https://github.com/NVIDIA/NeMo/tree/main/scripts/tokenizers) folder. \n"
]
},
{
Expand All @@ -152,7 +152,7 @@
"metadata": {},
"source": [
"Major difference from NeMo dataset configs to training using HF-datasets is for using hf-datasets, users need to mention the hf-dataset information through hf data config and pass to the script for downloading necessary data. Users can switch to another dataset by changing necessary fields in the hf data config. \n",
"Lets create that config here."
"Let's create that config here."
]
},
{
Expand All @@ -176,7 +176,7 @@
"source": [
"Since we need clean data for training tokenizer and models, we need to filter the data based on how dataset was constructed and how we would like the ASR model output to be. \n",
"\n",
"Based on prior analysis of text transcripts of the current hf dataset, we skip all non alphanumeric characters except . using `normalize_text` option to the `get_hf_text_data.py` script, based on `huggingface_data_tokenizer.yaml` config file. "
"Based on prior analysis of text transcripts of the current hf dataset, we skip all non-alphanumeric characters except `full-stop` using `normalize_text` option of the `get_hf_text_data.py` script, based on `huggingface_data_tokenizer.yaml` config file. "
]
},
{
Expand Down Expand Up @@ -204,7 +204,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As can be seen from the above command, we have used the `huggingface_data_tokenizer.yaml` config file to download the data from HF dataset. The download data is saved to `telugu_train_corpus.txt` file, which we will use to train the tokenizer. Before that lets look at some utterances from the normalized (preprocessed) text file."
"From the above command, we were able to use the `huggingface_data_tokenizer.yaml` config file to download the data from HF dataset. The download data is saved to `telugu_train_corpus.txt` file, which we will use to train the tokenizer. Before that, let's look at some utterances from the normalized (preprocessed) text file."
]
},
{
Expand Down Expand Up @@ -307,9 +307,9 @@
},
"source": [
"## Prepare the config \n",
"For finetuning the model, we need to update the config file to include the tokenizer and the dataset information. We will use the `parakeet_rnnt_6b` model and update the config file to include the tokenizer and the dataset information. For this we use `speech_to_text_hf_finetune.yaml` config file, and training script `speech_to_text_finetune.py` from the `examples/asr` folder.\n",
"For finetuning the model, we need to update the config file to include the tokenizer and the dataset information. We will use the `parakeet-rnnt-6b` model and update the config file to include the tokenizer and the dataset information. For this we use `speech_to_text_hf_finetune.yaml` config file, and training script `speech_to_text_finetune.py` from the `examples/asr` folder.\n",
"\n",
"For this demo, we shall strip the script to only include the necessary components for training the model on single GPU, however we recommend users to use the scripts directly from the `examples/asr` folder for training the model on multiple GPUs."
"For this demo, we will replicate only a portion of the script - in order to show just the necessary components for training the model on single GPU. However, we recommend users to use the scripts directly from the `examples/asr` folder for training the model on multiple GPUs."
]
},
{
Expand Down Expand Up @@ -541,7 +541,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can reduce number of steps to 5000 for this smaller dataset.\n",
"We can reduce the number of steps to 5000 for this small dataset.\n",
"and also update precision to float16 for faster training. # Change this bf16 training on Ampere based GPUs. "
]
},
Expand Down

0 comments on commit bedf0d7

Please sign in to comment.