NVIDIA · titu1994 · May 11, 2024 · Jul 3, 2023 · Jul 3, 2023 · Jul 5, 2023
diff --git a/examples/multimodal/speech_llm/README.md b/examples/multimodal/speech_llm/README.md
@@ -0,0 +1,163 @@
+# Modular SpeechLLM
+
+This directory contains example scripts to train and evaluate modular SpeechLLM (e.g, SALM[1], etc). 
+
+## Requirements
+You will need to install this specific branch of NeMo, or use the provided Dockerfile in the root directory of this repository to build a Docker image with all the necessary dependencies.
+
+## Architecture
+
+In general, there're three main components of a modular SpeechLLM: 
+- An audio encoder that processes the input audio and produces a sequence of audio embeddings.
+- A modality adapter that processes the audio embeddings and produces a sequence of embeddings in the same latent space as the token embeddings of a pretrained large language model (LLM).
+- A pretrained large language model (LLM) that processes embeddings from the modality adapter as well as token embeddings of input prompt, and produces the text output. The audio embeddings and text token embeddings are concatenated in time dimension before going into the LLM.
+
+
+## Usage
+
+### Input Format
+
+You'll need to prepare data in the NeMo manifest format, where each line is a python dictionary with some keys, for example:
+```
+{
+    "audio_filepath": "path/to/audio.wav",
+    "offset": 0.0, # offset of the audio in seconds, this is an optional field
+    "duration": 10.0 , # duration of the audio in seconds, can set to `None` to load the whole audio
+    "question": "what is the transcription of the audio?", # text prompt for the audio, see below for more details
+    "answer": "the transcription of the audio", # optional for inference, default to "na" in dataloader
+}
+```
+
+The `question` field in the manifest is optional, and you can put a list of questions in a question file (one question for each line) then set `++model.data.train_ds.question_file=<path to to question file>` to ask the dataloader to randomly pick a question from the file for each audio sample. This is useful for training with multiple prompts for the same task. If neither `question` field nor `question_file` is provided, the dataloader will use a default question `what does the audio mean?` for all audios. During inference, it is recommended to have the `question` field in the manifest.
+
+
+### Training
+
+There are several configs for training a SpeechLLM:
+- `conf/modular_audio_gpt_config_peft.yaml`: a config for training a SpeechLLM with PEFT (e.g., LoRA), where you don't want to tune the whole LLM but still want to adapt the LLM to your needs.
+- `conf/modular_audio_gpt_config_sft.yaml`: a config for training a SpeechLLM without PEFT, where you might want to tune the whole LLM or simply freeze it and use as is.
+- `conf/modular_audio_gpt_multi_enc_config_peft.yaml`: a config for training a SpeechLLM with multiple audio encoders and PEFT, where you can add speaker embeddings to the audio embeddings. Currently only TitaNet is supported as the speaker encoder.
+
+With any config, you can set the following flags to control which components to train or freeze:
+- `model.freeze_llm` # Generally set to `True` unless you want to fine-tune the whole LLM.
+- `model.freeze_audio_encoder` # Generally set to `False` unless you want to freeze the audio encoder.
+- `model.freeze_modality_adapter` # Generally set to `False` since we want to train the modality adapter.
+
+In addition to the config file, you will also need to prepare the audio encoder and the LLM as `*.nemo` files.
+
+To train a SpeechLLM that uses LoRA, you can run the following script:
+```bash
+MEGATRON_MODEL=/path/to/megatron-model.nemo
+ASR_MODEL=/path/to/audio-model.nemo  # only the encoder part will be loaded. e.g, stt_en_fastconformer_transducer_large.nemo 
+
+TRAIN_MANIFESTS="[/data/train_1.json,/data/train_2.json]"
+VAL_MANIFESTS="[/data/dev_1.json,/data/dev_2.json]"
+VAL_NAMES="[dev-1,dev-2]"  # names to display when logging validation results for each dataset
+
+CUDA_VISIBLE_DEVICES="0,1" python modular_audio_gpt_train.py --config-path="./conf" --config-name "modular_audio_gpt_config_peft" \
+    trainer.devices=-1 \
+    model.freeze_audio_encoder=True \
+    model.freeze_llm=True \
+    model.global_batch_size=4 \  # global_batch_size = micro_batch_size * num_gpus_per_node * num_nodes * accumulate_grad_batches
+    model.micro_batch_size=2 \  # micro_batch_size = batch_size_per_gpu
+    model.pretrained_audio_model=$ASR_MODEL \
+    model.restore_from_path=$MEGATRON_MODEL \
+    model.data.train_ds.manifest_filepath=$TRAIN_MANIFESTS \
+    model.data.validation_ds.manifest_filepath=$VAL_MANIFESTS \
+    ++model.data.validation_ds.names=$VAL_NAMES \
+```
+
+You can also use tarred datasets for faster training by converting normal NeMo datasets to tarred datasets using this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py) and follow the same dataset setting as shown in the script. Also, `accumulate_grad_batches` is automatically set by the model based on `global_batch_size` and `micro_batch_size`, so there's no need to manually calculate and set `trainer.accumulate_grad_batches`.
+
+
+#### **Multi-task Training**
+
+In order to use a question file, you can set `++model.data.train_ds.question_file=<path to to question file>` in the command line or use multiple question files with `++model.data.train_ds.question_file=[<path to to question file1>,<path to question file2>,...]`. If the number of question files is equal to the number of provided datasets, the dataloader will assigne each question file to a dataset. Otherwise, the dataloader will randomly pick a question file from all provided question files for each audio sample. Using multiple question files is useful for training with multiple tasks, where each task has its own set of prompts. Meanwhile, you can control the weights for different tasks/datasets by using concatentated tarred datasets, where you can assign weights to datasets by:
+```
+++model.data.train_ds.is_tarred=True \
+++model.data.train_ds.is_concat=True \
+++model.data.train_ds.manifest_filepath=[/path/to/data1/tarred_audio_manifest.json,/path/to/data2/tarred_audio_manifest.json] \
+++model.data.train_ds.tarred_audio_filepaths=[/path/to/data1/audio__OP_0..1023_CL_.tar,/path/to/data2/audio__OP_0..1023_CL_.tar] \
+++model.data.train_ds.concat_sampling_technique='random' \
+++model.data.train_ds.concat_sampling_probabilities=[0.4,0.6] \
+```
+
+#### **Available Audio Encoders**
+
+Currently all NeMo ASR models are supported, others may also work if they have an `encoder` attribute that returns a sequence of audio embeddings, and a `preprocessor` that takes raw audios and returns a sequence of features for the encoder. The model should also have a `cfg` attribute that returns a `omegaconf.DictConfig` object of model configuration. In addition to a local model, you can also set `pretrained_audio_model` to a model from NGC (e.g., `stt_en_fastconformer_transducer_large`) or Huggingface (e.g., `nvidia/parakeet-rnnt-1.1b`), and the script will download the model and use it for training.
+
+
+### Inference
+
+The script you need to perform inference is `modular_audio_gpt_eval.py`, and the corresponding config file is `conf/modular_audio_gpt_config_eval.yaml`, where you mainly need to set the `model.data.test_ds` fields as well as paths to the checkpoints.
+
+#### **Inference with Intermediate Checkpoints**
+
+If you want to perform inference with intermediate checkpoints, where there's no single NeMo checkpoint file that contains all the model parameters, you can use the following script to load each component from its own checkpoint file and perform inference:
+
+```bash
+MEGATRON_CKPT=/path/to/megatron-llm.nemo
+ALM_DIR=/path/to/nemo_experiments/job_name
+# below is the path to the config used during training
+ALM_YAML=$ALM_DIR/version_0/hparams.yaml
+# this checkpoint file only contains the trainable params, the backslash is used to avoid hyrda parsing error
+ALM_CKPT="$ALM_DIR/checkpoints/AudioGPT--validation_wer\=0.2-step\=100000-epoch\=0-last.ckpt"  
+
+TEST_MANIFESTS="[/data/test_1.json,/data/test_2.json]"
+TEST_NAMES="[test-1,test-2]"
+
+CUDA_VISIBLE_DEVICES=0 python modular_audio_gpt_eval.py \
+    model.restore_from_path=$MEGATRON_CKPT \
+    model.peft.restore_from_path=$ALM_CKPT \
+    model.peft.restore_from_hparams_path=$ALM_YAML \
+    model.data.test_ds.manifest_filepath=$TEST_MANIFESTS \
+    model.data.test_ds.names=$TEST_NAMES \
+    model.data.test_ds.metric.name="bleu" \
+    model.data.test_ds.global_batch_size=8 \
+    model.data.test_ds.micro_batch_size=8 \
+    model.data.test_ds.tokens_to_generate=256 \
+    ++inference.greedy=False \
+    ++inference.top_k=50 \
+    ++inference.top_p=0.95 \
+    ++inference.temperature=0.4 \
+    ++inference.repetition_penalty=1.2 \
+    ++model.data.test_ds.output_dir=${ALM_DIR}
+```
+
+If you froze the audio encoder during training, you will also need to add the following line to the above script:
+```bash
+++model.pretrained_audio_model=/path/to/audio/model.nemo
+```
+
+If you want to save the intermediate checkpoints to a single NeMo checkpoint file, you can add the following line to the above script:
+```bash
+++save_to_nemo=/path/to/save/model.nemo
+```
+
+#### **Inference with Complete SpeechLLM Checkpoints**
+
+If you want to load a trained SpeechLLM from cloud, you can use the following script:
+```bash
+TEST_MANIFESTS="[/data/test_1.json,/data/test_2.json]"
+TEST_NAMES="[test-1,test-2]"
+
+CUDA_VISIBLE_DEVICES=0 python modular_audio_gpt_eval.py \
+    model.from_pretrained="speechllm_fc_llama2_7b" \
+    model.data.test_ds.manifest_filepath=$TEST_MANIFESTS \
+    model.data.test_ds.names=$TEST_NAMES \
+    model.data.test_ds.global_batch_size=8 \
+    model.data.test_ds.micro_batch_size=8 \
+	model.data.test_ds.tokens_to_generate=256 \
+    ++inference.greedy=False \
+    ++inference.top_k=50 \
+    ++inference.top_p=0.95 \
+    ++inference.temperature=0.4 \
+    ++inference.repetition_penalty=1.2 \
+    ++model.data.test_ds.output_dir="./test_outputs"
+```
+
+If you have a local `.nemo` file, you can use `model.restore_from_path=/path/to/model.nemo` to replace the line `model.from_pretrained="speechllm_fc_llama2_7b"` in the above example.
+
+
+## Reference
+[1] Chen, Z.\*, Huang, H.\*, Andrusenko, A., Hrinchuk, O., Puvvada, K.C., Li, J., Ghosh, S., Balam, J. and Ginsburg, B., 2023. SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation. ICASSP'24.
diff --git a/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_eval.yaml b/examples/multimodal/speech_llm/conf/modular_audio_gpt_config_eval.yaml
@@ -0,0 +1,142 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# this config is used to perform inference on SpeechLLM checkpoints
+name: megatron_audio_gpt_eval
+
+trainer:
+  devices: 1
+  accelerator: gpu
+  num_nodes: 1
+  precision: bf16
+  logger: False # logger provided by exp_manager
+  enable_checkpointing: False
+  use_distributed_sampler: False
+  max_epochs: 9999
+  max_steps: 20000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
+  log_every_n_steps: 10 # frequency with which training steps are logged 
+  val_check_interval: 200 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
+  gradient_clip_val: 1.0
+
+exp_manager:
+  explicit_log_dir: null
+  exp_dir: null
+  name: ${name}
+  create_wandb_logger: False
+  wandb_logger_kwargs:
+    project: null
+    name: null
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    monitor: validation_${model.data.validation_ds.metric.name}
+    save_top_k: 1
+    mode: min
+    save_nemo_on_train_end: True
+    filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}'
+    model_parallel_size: ${model.tensor_model_parallel_size}
+    always_save_nemo: True
+    save_best_model: False
+
+model:
+  from_pretrained: null  # pretrained model name on NGC or HF
+  restore_from_path: null # Path to an existing .nemo model you wish to add new tasks to or run inference with
+  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
+  pretrained_audio_model: null  # Path to a .nemo model for audio encoder
+
+  seed: 1234
+  tensor_model_parallel_size: 1 # intra-layer model parallelism
+  pipeline_model_parallel_size: 1 # inter-layer model parallelism
+
+  global_batch_size: 1
+  micro_batch_size: 1
+  sync_batch_comm: False
+  megatron_amp_O2: False
+
+  ## Sequence Parallelism
+  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
+  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
+  sequence_parallel: False
+
+  ## Activation Checkpoint 
+  activations_checkpoint_granularity: null # 'selective' or 'full' 
+  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
+  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
+  # of each chunk at the specified granularity
+  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
+  activations_checkpoint_num_layers: null # not used with 'selective'
+  activations_checkpoint_layers_per_pipeline: null
+  answer_only_loss: False # not used right now
+  gradient_as_bucket_view: False
+
+  hidden_dropout: 0.0
+  attention_dropout: 0.0
+  ffn_dropout: 0.0
+
+  peft: # keep these basic params for reusing in both sft and peft SpeechLMs
+    restore_from_path: null
+    restore_from_hparams_path: null
+    restore_from_ckpt:
+      checkpoint_name: null
+      checkpoint_dir: null
+
+
+  data:
+    test_ds:
+      manifest_filepath: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+      names: null # Names of the corresponding datasets used to log metrics.
+      global_batch_size: 1
+      micro_batch_size: 1
+      shuffle: False
+      num_workers: 0
+      pin_memory: True
+      max_seq_length: 2048
+      min_seq_length: 1
+      drop_last: False
+      end_string: ${data.train_ds.end_string}  # don't change, let hydra resolve from saved config
+      context_key: ${data.train_ds.context_key} # don't change, let hydra resolve from saved config
+      label_key: ${data.train_ds.label_key} # don't change, let hydra resolve from saved config
+      add_eos: ${data.train_ds.add_eos} # don't change, let hydra resolve from saved config
+      add_sep: ${data.train_ds.add_sep} # don't change, let hydra resolve from saved config
+      add_bos: ${data.train_ds.add_bos} # don't change, let hydra resolve from saved config
+      separate_prompt_and_response_with_newline: ${data.train_ds.separate_prompt_and_response_with_newline}
+      write_predictions_to_file: True
+      output_file_path_prefix: "preds" # Prefix of the file to write predictions to.
+      truncation_field: ${data.train_ds.truncation_field}  # don't change, let hydra resolve from saved config
+      index_mapping_dir: null # Path to a directory to write index mapping files.
+      prompt_template: ${data.train_ds.prompt_template} # don't change, let hydra resolve from saved config
+      tokens_to_generate: 512
+      log_every_n_steps: 1
+      sample_rate: ${data.train_ds.sample_rate} # don't change, let hydra resolve from saved config
+      audio_locator: null # set it to allow multiple audios in a sample, e.g. '|audio|', and use it in the context field of manifest to specify the locations of audios (`audio_filepath` is a list of audios).
+
+      metric:
+        name: "bleu" # Name of the evaluation metric to use. Options: ['exact_string_match', 'loss', 'wer', 'bleu', 'rouge']
+        average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
+        num_classes: null
+
+save_as_nemo: null  # optional string, set to save the whole model into a single nemo file
+evaluate_metric: True  # set 'true' to calculate metrics (must have labels in data), 'false' to use trainer.predict() to do inference only
+inference:
+  greedy: True # Whether or not to use sampling ; use greedy decoding otherwise
+  top_k: 0  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+  top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+  temperature: 1.0 # sampling temperature
+  all_probs: False  # whether return the log prob for all the tokens in vocab
+  repetition_penalty: 1.2  # The parameter for repetition penalty. 1.0 means no penalty.
+  min_tokens_to_generate: 0  # The minimum length of the sequence to be generated.
+  compute_logprob: False  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+  outfile_path: output.txt
+  compute_attention_mask: True