Prompt learning of Huggingface T5v1.1 converted checkpoints (NVIDIA#4746

) * update branch Signed-off-by: ericharper <[email protected]> * update package info and dockerfile Signed-off-by: ericharper <[email protected]> * fix fastpitch export (NVIDIA#4676) Signed-off-by: Jason <[email protected]> * [TTS] fixed wrong pronunciations for r1.11. (NVIDIA#4677) * [TTS] fixed wrong pronunciations. Signed-off-by: Xuesong Yang <[email protected]> * incremented the version number to 22.08 as @blisc suggested. Signed-off-by: Xuesong Yang <[email protected]> * correct cmudict versions in world-wide places. Signed-off-by: Xuesong Yang <[email protected]> * Fix for incorrect batch size issue while decoding (NVIDIA#4675) Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Initial Signed-off-by: MaximumEntropy <[email protected]> * Fix for RPE Signed-off-by: MaximumEntropy <[email protected]> * Style Signed-off-by: MaximumEntropy <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * update package info and dockerfile Signed-off-by: ericharper <[email protected]> * fix fastpitch export (NVIDIA#4676) Signed-off-by: Jason <[email protected]> * [TTS] fixed wrong pronunciations for r1.11. (NVIDIA#4677) * [TTS] fixed wrong pronunciations. Signed-off-by: Xuesong Yang <[email protected]> * incremented the version number to 22.08 as @blisc suggested. Signed-off-by: Xuesong Yang <[email protected]> * correct cmudict versions in world-wide places. Signed-off-by: Xuesong Yang <[email protected]> * Fix for incorrect batch size issue while decoding (NVIDIA#4675) Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Make megatron legacy configurable Signed-off-by: MaximumEntropy <[email protected]> * Enc-Dec checksum matching Signed-off-by: MaximumEntropy <[email protected]> * Add conversion script Signed-off-by: MaximumEntropy <[email protected]> * Reset files Signed-off-by: MaximumEntropy <[email protected]> * Reset docker and jenkinsfile Signed-off-by: MaximumEntropy <[email protected]> * Reset README Signed-off-by: MaximumEntropy <[email protected]> * Remove tts scripts files Signed-off-by: MaximumEntropy <[email protected]> * Style Signed-off-by: MaximumEntropy <[email protected]> * Update finetuning script Signed-off-by: MaximumEntropy <[email protected]> * add cloning Signed-off-by: Abhinav Khattar <[email protected]> * map to cpu Signed-off-by: Abhinav Khattar <[email protected]> * Fix TP change for HF exported models Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Update conversion script and style Signed-off-by: MaximumEntropy <[email protected]> * Add base config Signed-off-by: MaximumEntropy <[email protected]> * Add arg Signed-off-by: MaximumEntropy <[email protected]> * Change partition comment update Signed-off-by: MaximumEntropy <[email protected]> * Update base config Signed-off-by: MaximumEntropy <[email protected]> * Minor fix for prompt learning Signed-off-by: MaximumEntropy <[email protected]> * style Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Fix default Signed-off-by: MaximumEntropy <[email protected]> * Fix to latest ptl Signed-off-by: MaximumEntropy <[email protected]> * Add arg to perceiver Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Temporarily add Signed-off-by: MaximumEntropy <[email protected]> * Restore Signed-off-by: MaximumEntropy <[email protected]> * Move tokens head bias to cfg population Signed-off-by: MaximumEntropy <[email protected]> * Fixes Signed-off-by: MaximumEntropy <[email protected]> * Empty Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Fixes to get decode to work. Signed-off-by: MaximumEntropy <[email protected]> * More changes Signed-off-by: MaximumEntropy <[email protected]> * Update base config Signed-off-by: MaximumEntropy <[email protected]> * Test Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Update config to 0 dropout Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Reset file Signed-off-by: MaximumEntropy <[email protected]> * Remove scheduler Signed-off-by: MaximumEntropy <[email protected]> * Changes Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Support generic bos id Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Minor Signed-off-by: MaximumEntropy <[email protected]> * Minor Signed-off-by: MaximumEntropy <[email protected]> * Fixes Signed-off-by: MaximumEntropy <[email protected]> * Minor changes Signed-off-by: MaximumEntropy <[email protected]> * Add embedding dropout Signed-off-by: MaximumEntropy <[email protected]> * Changes for ul2 Signed-off-by: MaximumEntropy <[email protected]> * Fix for pad id Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Update models that can be converted Signed-off-by: MaximumEntropy <[email protected]> * Fix inference Signed-off-by: MaximumEntropy <[email protected]> * Remove ipdb Signed-off-by: MaximumEntropy <[email protected]> * Fix typo Signed-off-by: MaximumEntropy <[email protected]> * Load ul2 in bf16 Signed-off-by: MaximumEntropy <[email protected]> * Add amp o2 arg Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Tmp Signed-off-by: MaximumEntropy <[email protected]> * Fix rmsnorm Signed-off-by: MaximumEntropy <[email protected]> * Reset config Signed-off-by: MaximumEntropy <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix eval for converted models Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Fix Signed-off-by: MaximumEntropy <[email protected]> * Style Signed-off-by: MaximumEntropy <[email protected]> * Update predict step for adapters Signed-off-by: MaximumEntropy <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor Signed-off-by: MaximumEntropy <[email protected]> * Fixes Signed-off-by: MaximumEntropy <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: ericharper <[email protected]> Signed-off-by: Jason <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Signed-off-by: MaximumEntropy <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: Jason <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: Rajesh Ilango <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Hainan Xu <[email protected]>
hainan-xv · Nov 29, 2022 · d13c77e · d13c77e
1 parent 0630726
commit d13c77e
Show file tree

Hide file tree

Showing 31 changed files with 979 additions and 248 deletions.
diff --git a/examples/nlp/language_modeling/conf/megatron_bart_config.yaml b/examples/nlp/language_modeling/conf/megatron_bart_config.yaml
@@ -81,6 +81,9 @@ model:
   share_token_embeddings: True # If True share encoder/decoder embeddings
   share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
 
+  # token head
+  tokens_head_bias: True
+
   # precision
   native_amp_init_scale: 4294967296 # 2 ** 32
   native_amp_growth_interval: 1000

diff --git a/examples/nlp/language_modeling/conf/megatron_model_base_config.yaml b/examples/nlp/language_modeling/conf/megatron_model_base_config.yaml
@@ -5,6 +5,7 @@ num_attention_heads: 12
 init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
 hidden_dropout: 0.1 # Dropout probability for hidden state transformer.
 attention_dropout: 0.1 # Dropout probability in the attention layer.
+ffn_dropout: 0.0 # Dropout probability in the feed-forward layer.
 position_embedding_type: 'learned_absolute' # Position embedding type. Options ['learned_absolute', 'relative']
 relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias
 relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets.
@@ -30,3 +31,5 @@ onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
 fp32_residual_connection: False # Use FP32 for residual connections.
 activations_checkpoint_method: null # 'uniform', 'block'
 activations_checkpoint_num_layers: 1 
+megatron_legacy: False # Whether to use the legacy Megatron model. This affects the way q,k,v is partitioned from the mixed q,k,v layer in ParallelAttention. This needs to be True for models converted from HF.
+normalize_attention_scores: True # Whether to scale the output Q * K^T by 1 / sqrt(hidden_size_per_head). This arg is provided as a configuration option mostly for compatibility with models that have been weight-converted from HF. You almost always want to se this to True.
diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config.yaml
@@ -81,6 +81,9 @@ model:
   share_token_embeddings: True # If True share encoder/decoder embeddings
   share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
 
+  # token head
+  tokens_head_bias: True
+
   # precision
   native_amp_init_scale: 4294967296 # 2 ** 32
   native_amp_growth_interval: 1000

diff --git a/examples/nlp/language_modeling/conf/megatron_t5_prompt_learning.yaml b/examples/nlp/language_modeling/conf/megatron_t5_prompt_learning.yaml
@@ -71,11 +71,15 @@ model:
   data:
     train_ds: ["data/squad_train.jsonl"]
     validation_ds: ["data/squad_val.jsonl"]
-    add_eos: True
-    shuffle: True
-    num_workers: 8
-    pin_memory: True
-
+    add_eos: true
+    add_bos: false
+    decoder_starts_with_pad: False
+    add_eos_to_decoder_output: True
+    add_sentinel_to_input: True
+    ul2_prompt_token: null # <extra_id_s>, <extra_id_r>, <extra_id_x>
+    shuffle: true
+    num_workers: 4
+    pin_memory: true
 
   optim:
     name: fused_adam
@@ -90,6 +94,4 @@ model:
       constant_steps: 0
       min_lr: 0.0
       monitor: val_loss
-      reduce_on_plateau: false
-
-
+      reduce_on_plateau: false
diff --git a/examples/nlp/language_modeling/conf/megatron_ul2_config.yaml b/examples/nlp/language_modeling/conf/megatron_ul2_config.yaml
@@ -80,6 +80,9 @@ model:
   share_token_embeddings: True # If True share encoder/decoder embeddings
   share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
 
+  # token head
+  tokens_head_bias: True
+
   # precision
   native_amp_init_scale: 4294967296 # 2 ** 32
   native_amp_growth_interval: 1000

diff --git a/examples/nlp/language_modeling/megatron_change_num_partitions.py b/examples/nlp/language_modeling/megatron_change_num_partitions.py
@@ -61,7 +61,7 @@ def merge_partition(model, partitions, write_path=None):
         model.save_to(write_path)
 
 
-def split_partition(model, partitions, tp_size, write_path=None):
+def split_partition(model, partitions, tp_size, write_path=None, megatron_legacy=False):
     if len(partitions) != 1:
         raise ValueError(
             "Can only split partitions of model with TP=1. For partitions of models with TP>1, merge first."
@@ -80,13 +80,38 @@ def split_partition(model, partitions, tp_size, write_path=None):
 
     idx = 0
     splits = []
-    for _, param in model.named_parameters():
+    for param_name, param in model.named_parameters():
         if param.shape == partitions[0][idx].shape:
             split = [partitions[0][idx].data] * tp_size
         elif param.shape[0] == partitions[0][idx].shape[0]:
             split = torch.split(partitions[0][idx].data, param.shape[-1], dim=-1)
         else:
-            split = torch.split(partitions[0][idx].data, param.shape[0], dim=0)
+            # For T5-converted weights, the splitting needs to be strided such that q,k,v weights are bunched together on each tensor-parallel rank.
+            if 'query_key_value.weight' in param_name and megatron_legacy:
+                split_dim = partitions[0][idx].data.shape[0]
+                if split_dim % (tp_size * 3) != 0:
+                    raise ValueError(
+                        f"Can not split Q,K,V parameter {param_name} with shape {param.shape} into tensor parallel size {tp_size}. Not divisible by {tp_size * 3}."
+                    )
+                tp_qkv_splits = torch.chunk(partitions[0][idx].data, tp_size * 3, dim=0)
+                split = []
+                for i in range(tp_size):
+                    tp_qkv = torch.cat([tp_qkv_splits[item] for item in range(i, tp_size * 3, tp_size)])
+                    split.append(tp_qkv)
+            elif 'key_value.weight' in param_name and megatron_legacy:
+                split_dim = partitions[0][idx].data.shape[0]
+                if split_dim % (tp_size * 2) != 0:
+                    raise ValueError(
+                        f"Can not split K,V parameter {param_name} with shape {param.shape} into tensor parallel size {tp_size}. Not divisible by {tp_size * 2}."
+                    )
+                tp_qkv_splits = torch.chunk(partitions[0][idx].data, tp_size * 2, dim=0)
+                split = []
+                for i in range(tp_size):
+                    tp_qkv = torch.cat([tp_qkv_splits[item] for item in range(i, tp_size * 2, tp_size)])
+                    split.append(tp_qkv)
+            # Regular split for Megatron and NeMo-Megatron models.
+            else:
+                split = torch.split(partitions[0][idx].data, param.shape[0], dim=0)
         splits.append(split)
         idx += 1
 
@@ -134,6 +159,18 @@ def main():
         help="NeMo model class. This script should support all NeMo megatron models that use Tensor Parallel",
     )
     parser.add_argument("--precision", default=16, help="PyTorch Lightning Trainer precision flag")
+    parser.add_argument(
+        "--megatron_legacy",
+        action="store_true",
+        help="Converter for legacy megatron modles that have different q,k,v weight splits",
+    )
+    parser.add_argument(
+        "--tokenizer_model_path",
+        type=str,
+        required=False,
+        default=None,
+        help="Path to the tokenizer model path if your model uses a tokenizer model as an artifact. This is needed if your model uses a sentencepiece tokenizer.",
+    )
 
     args = parser.parse_args()
 
@@ -169,6 +206,8 @@ def main():
         model.cfg.tensor_model_parallel_size = 1
         app_state.model_parallel_size = 1
         trainer = Trainer(devices=1, strategy=NLPDDPStrategy(), accelerator="cpu", precision=precision)
+        if args.tokenizer_model_path is not None:
+            model.cfg.tokenizer.model = args.tokenizer_model_path
         model = cls(model.cfg, trainer).to('cpu')
         model._save_restore_connector = NLPSaveRestoreConnector()
 
@@ -188,10 +227,11 @@ def main():
         model.cfg.tensor_model_parallel_size = tgt_tp_size
         app_state.model_parallel_size = tgt_tp_size
         trainer = Trainer(devices=1, strategy=NLPDDPStrategy(), accelerator="cpu", precision=precision)
+        if args.tokenizer_model_path is not None:
+            model.cfg.tokenizer.model = args.tokenizer_model_path
         model = cls(model.cfg, trainer).to('cpu')
         model._save_restore_connector = NLPSaveRestoreConnector()
-
-        split_partition(model, partitions, tgt_tp_size, args.target_file)
+        split_partition(model, partitions, tgt_tp_size, args.target_file, args.megatron_legacy)
 
     logging.info("Successfully finished changing partitions!")
 

diff --git a/examples/nlp/language_modeling/megatron_t5_eval.py b/examples/nlp/language_modeling/megatron_t5_eval.py
@@ -16,6 +16,7 @@
 from argparse import ArgumentParser
 
 import torch
+from omegaconf.omegaconf import OmegaConf, open_dict
 from pytorch_lightning.trainer.trainer import Trainer
 from torch.utils.data import DataLoader
 
@@ -47,6 +48,8 @@ def main():
         "--pipeline_model_parallel_split_rank", type=int, default=0, required=False,
     )
     parser.add_argument("--precision", default="16", type=str, help="PyTorch Lightning Trainer precision flag")
+    parser.add_argument("--decoder_starts_with_pad", action="store_true", help="Decoder starts with pad token")
+    parser.add_argument("--add_eos_to_encoder_input", action="store_true", help="Encoder input ends with EOS token")
     args = parser.parse_args()
 
     # cast precision to int if 32 or 16
@@ -79,14 +82,30 @@ def main():
             pipeline_model_parallel_split_rank_=args.pipeline_model_parallel_split_rank,
         )
 
+    model_cfg = MegatronT5Model.restore_from(
+        restore_path=args.model_file,
+        trainer=trainer,
+        save_restore_connector=NLPSaveRestoreConnector(),
+        return_config=True,
+    )
+    OmegaConf.set_struct(model_cfg, True)
+    with open_dict(model_cfg):
+        model_cfg.precision = trainer.precision
+
     model = MegatronT5Model.restore_from(
-        restore_path=args.model_file, trainer=trainer, save_restore_connector=NLPSaveRestoreConnector(),
+        restore_path=args.model_file,
+        trainer=trainer,
+        save_restore_connector=NLPSaveRestoreConnector(),
+        override_config_path=model_cfg,
     )
     model.freeze()
+    model.training = False
 
     request = {
         "prompt": args.prompt,
         "tokens_to_generate": args.tokens_to_generate,
+        "bos_id": model.tokenizer.pad_id if args.decoder_starts_with_pad else model.tokenizer.bos_id,
+        "add_eos_to_encoder_input": args.add_eos_to_encoder_input,
     }
 
     dataset = T5RequestDataset(request, model.tokenizer)

diff --git a/examples/nlp/language_modeling/megatron_t5_prompt_learning_eval.py b/examples/nlp/language_modeling/megatron_t5_prompt_learning_eval.py
@@ -115,7 +115,7 @@ def dummy():
     outputs = trainer.predict(model, test_dl)
     with open(cfg.pred_file_path, "w", encoding="utf-8") as pred_file:
         for batch in outputs:
-            preds = batch["predicted_token_ids"]
+            preds = batch["preds_text"]
             for pred in preds:
                 pred = pred.strip().replace("\n", " ")
                 pred_file.write(pred + "\n")

diff --git a/examples/nlp/language_modeling/megatron_t5_seq2seq_finetune.py b/examples/nlp/language_modeling/megatron_t5_seq2seq_finetune.py
@@ -86,10 +86,20 @@ def main(cfg) -> None:
     # Override the T5 configuration with the one from the config file.
     OmegaConf.set_struct(t5_cfg, True)
     with open_dict(t5_cfg):
-        t5_cfg.masked_softmax_fusion = False
         t5_cfg.megatron_amp_O2 = cfg.model.get('megatron_amp_O2', False)
-        t5_cfg.hidden_dropout = cfg.model.get('hidden_dropout', 0.1)
-        t5_cfg.attention_dropout = cfg.model.get('attention_dropout', 0.1)
+        if hasattr(t5_cfg, 'encoder') and hasattr(t5_cfg, 'decoder'):
+            t5_cfg.encoder.masked_softmax_fusion = False
+            t5_cfg.decoder.masked_softmax_fusion = False
+            t5_cfg.encoder.hidden_dropout = cfg.model.get('hidden_dropout', 0.1)
+            t5_cfg.decoder.hidden_dropout = cfg.model.get('hidden_dropout', 0.1)
+            if hasattr(t5_cfg.encoder, 'ffn_dropout'):
+                t5_cfg.encoder.ffn_dropout = cfg.model.get('ffn_dropout', 0.1)
+            if hasattr(t5_cfg.decoder, 'ffn_dropout'):
+                t5_cfg.decoder.ffn_dropout = cfg.model.get('ffn_dropout', 0.1)
+        else:
+            t5_cfg.hidden_dropout = cfg.model.get('hidden_dropout', 0.1)
+            t5_cfg.attention_dropout = cfg.model.get('attention_dropout', 0.1)
+            t5_cfg.masked_softmax_fusion = False
         t5_cfg.data = cfg.model.data
         t5_cfg.precision = cfg.trainer.precision
         t5_cfg.optim = cfg.model.optim

diff --git a/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml b/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml
@@ -80,6 +80,9 @@ model:
   share_token_embeddings: True # If True share encoder/decoder embeddings
   share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
 
+  # token head
+  tokens_head_bias: True
+
   # precision
   native_amp_init_scale: 4294967296 # 2 ** 32
   native_amp_growth_interval: 1000

diff --git a/nemo/collections/nlp/data/language_modeling/megatron/request_dataset.py b/nemo/collections/nlp/data/language_modeling/megatron/request_dataset.py
@@ -81,6 +81,7 @@ def __init__(self, request: Dict, tokenizer) -> None:
         super().__init__()
         self.request = request
         self.tokenizer = tokenizer
+        self.add_eos_to_encoder_input = self.request['add_eos_to_encoder_input']
 
         # tokenize prompt
         self.request['tokenized_prompt'] = ' '.join(self.tokenizer.text_to_tokens(request['prompt']))
@@ -96,7 +97,10 @@ def mask_prompt(self, sample):
                 sample[i] = f'<extra_id_{sentinel_idx}>'
                 sentinel_idx += 1
         sample = ' '.join(sample)
-        sample = torch.LongTensor(self.tokenizer.text_to_ids(sample))
+        sample = self.tokenizer.text_to_ids(sample)
+        if self.add_eos_to_encoder_input:
+            sample = sample + [self.tokenizer.eos_id]
+        sample = torch.LongTensor(sample)
         self.request['masked_sample'] = sample
 
     def __len__(self):

diff --git a/nemo/collections/nlp/data/language_modeling/megatron/t5_dataset.py b/nemo/collections/nlp/data/language_modeling/megatron/t5_dataset.py
@@ -58,7 +58,6 @@ def __init__(
         documents=None,
     ):
         super().__init__()
-
         # Params to store.
         self.name = name
         self.seed = seed

diff --git a/nemo/collections/nlp/data/language_modeling/megatron/t5_prompt_learning_dataset.py b/nemo/collections/nlp/data/language_modeling/megatron/t5_prompt_learning_dataset.py
@@ -19,24 +19,55 @@
 from tqdm.auto import tqdm
 
 from nemo.collections.nlp.data.language_modeling.megatron.base_prompt_learning_dataset import BasePromptLearningDataset
+from nemo.collections.nlp.models.language_modeling.megatron_t5_model import T5Sentinel
 from nemo.collections.nlp.modules.common import VirtualPromptSource
 from nemo.collections.nlp.modules.common.megatron.utils import build_position_ids
 from nemo.utils import logging
 
 __all__ = ['T5PromptLearningDataset']
 
 
-class T5Sentinel(enum.Enum):
-    FIRST = '<extra_id_0>'
-
-
 class T5PromptLearningDataset(BasePromptLearningDataset):
     """
     The dataset class for prompt-tuning or p-tuning pretrained T5 models.
     """
 
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
+    def __init__(
+        self,
+        datasets,
+        tokenizer,
+        virtual_prompt_source: VirtualPromptSource,
+        task_templates: dict,
+        pseudo_tokens,
+        pad_token_id: str,
+        max_seq_length: int,
+        min_seq_length: int = 1,
+        add_bos: bool = False,
+        add_eos: bool = True,
+        for_train: bool = True,
+        decoder_starts_with_pad: bool = False,
+        add_eos_to_decoder_output: bool = True,
+        add_sentinel_to_input: bool = True,
+        ul2_prompt_token: str = None,
+    ):
+        # These two variables need to be set before calling super().__init__() because the parent class calls `load_data()` which requires these attributes.
+        self.decoder_starts_with_pad = decoder_starts_with_pad
+        self.add_eos_to_decoder_output = add_eos_to_decoder_output
+        self.add_sentinel_to_input = add_sentinel_to_input
+        self.ul2_prompt_token = ul2_prompt_token
+        super().__init__(
+            datasets=datasets,
+            tokenizer=tokenizer,
+            virtual_prompt_source=virtual_prompt_source,
+            task_templates=task_templates,
+            pseudo_tokens=pseudo_tokens,
+            pad_token_id=pad_token_id,
+            max_seq_length=max_seq_length,
+            min_seq_length=min_seq_length,
+            add_bos=add_bos,
+            add_eos=add_eos,
+            for_train=for_train,
+        )
 
     def load_data(self, dataset):
         """
@@ -50,7 +81,6 @@ def load_data(self, dataset):
                      containing the information needed for a training example
         """
         skipped = 0
-
         for json_line in tqdm(dataset):
 
             # Read example dict or load the information for a single example from .json file
@@ -84,9 +114,15 @@ def load_data(self, dataset):
             input_example = self._insert_virtual_token_placeholders(input_example, virtual_token_splits)
 
             # a trick to align with the data format in t5 pretraining
-            input_ids = self.tokenizer.text_to_ids(input_example) + self.tokenizer.text_to_ids(T5Sentinel.FIRST.value)
+            input_ids = self.tokenizer.text_to_ids(input_example)
+            if self.add_sentinel_to_input:
+                input_ids = input_ids + self.tokenizer.text_to_ids(T5Sentinel.FIRST.value)
 
             # Add BOS/EOS to the input of encoder if desired, adds EOS by default
+            if self.ul2_prompt_token is not None:
+                ul2_prompt_token_id = self.tokenizer.text_to_ids(self.ul2_prompt_token)
+                assert len(ul2_prompt_token_id) == 1
+                input_ids = ul2_prompt_token_id + input_ids
             if self.add_bos:
                 input_ids = [self.tokenizer.bos_id] + input_ids
             if self.add_eos:
@@ -100,13 +136,18 @@ def load_data(self, dataset):
             if answer_field in doc.keys():  # training and validation
                 answer_text = doc[answer_field]
 
+                if self.decoder_starts_with_pad:
+                    answer_text_ids = [self.tokenizer.pad_id]
+                else:
+                    answer_text_ids = [self.tokenizer.bos_id]
                 # a trick to align with the data format in t5 pretraining
-                answer_text_ids = (
-                    [self.tokenizer.bos_id]
-                    + self.tokenizer.text_to_ids(T5Sentinel.FIRST.value)
-                    + self.tokenizer.text_to_ids(answer_text)
-                    + [self.tokenizer.eos_id]
-                )
+                if self.add_sentinel_to_input:
+                    answer_text_ids += self.tokenizer.text_to_ids(T5Sentinel.FIRST.value)
+                answer_text_ids += self.tokenizer.text_to_ids(answer_text)
+                if self.add_eos_to_decoder_output:
+                    answer_text_ids += [self.tokenizer.eos_id]
+                else:
+                    answer_text_ids += self.tokenizer.text_to_ids(T5Sentinel.END.value)
 
             # Skip example if the final length doesn't fit length requirements even after truncation
             if self.min_seq_length <= len(input_ids) <= self.max_seq_length: