Skip to content

Commit

Permalink
Prompt learning of Huggingface T5v1.1 converted checkpoints (NVIDIA#4746
Browse files Browse the repository at this point in the history
)

* update branch

Signed-off-by: ericharper <[email protected]>

* update package info and dockerfile

Signed-off-by: ericharper <[email protected]>

* fix fastpitch export (NVIDIA#4676)

Signed-off-by: Jason <[email protected]>

* [TTS] fixed wrong pronunciations for r1.11. (NVIDIA#4677)

* [TTS] fixed wrong pronunciations.

Signed-off-by: Xuesong Yang <[email protected]>

* incremented the version number to 22.08 as @blisc suggested.

Signed-off-by: Xuesong Yang <[email protected]>

* correct cmudict versions in world-wide places.

Signed-off-by: Xuesong Yang <[email protected]>

* Fix for incorrect batch size issue while decoding (NVIDIA#4675)

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Initial

Signed-off-by: MaximumEntropy <[email protected]>

* Fix for RPE

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* update branch

Signed-off-by: ericharper <[email protected]>

* update package info and dockerfile

Signed-off-by: ericharper <[email protected]>

* fix fastpitch export (NVIDIA#4676)

Signed-off-by: Jason <[email protected]>

* [TTS] fixed wrong pronunciations for r1.11. (NVIDIA#4677)

* [TTS] fixed wrong pronunciations.

Signed-off-by: Xuesong Yang <[email protected]>

* incremented the version number to 22.08 as @blisc suggested.

Signed-off-by: Xuesong Yang <[email protected]>

* correct cmudict versions in world-wide places.

Signed-off-by: Xuesong Yang <[email protected]>

* Fix for incorrect batch size issue while decoding (NVIDIA#4675)

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Make megatron legacy configurable

Signed-off-by: MaximumEntropy <[email protected]>

* Enc-Dec checksum matching

Signed-off-by: MaximumEntropy <[email protected]>

* Add conversion script

Signed-off-by: MaximumEntropy <[email protected]>

* Reset files

Signed-off-by: MaximumEntropy <[email protected]>

* Reset docker and jenkinsfile

Signed-off-by: MaximumEntropy <[email protected]>

* Reset README

Signed-off-by: MaximumEntropy <[email protected]>

* Remove tts scripts files

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Update finetuning script

Signed-off-by: MaximumEntropy <[email protected]>

* add cloning

Signed-off-by: Abhinav Khattar <[email protected]>

* map to cpu

Signed-off-by: Abhinav Khattar <[email protected]>

* Fix TP change for HF exported models

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Update conversion script and style

Signed-off-by: MaximumEntropy <[email protected]>

* Add base config

Signed-off-by: MaximumEntropy <[email protected]>

* Add arg

Signed-off-by: MaximumEntropy <[email protected]>

* Change partition comment update

Signed-off-by: MaximumEntropy <[email protected]>

* Update base config

Signed-off-by: MaximumEntropy <[email protected]>

* Minor fix for prompt learning

Signed-off-by: MaximumEntropy <[email protected]>

* style

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Fix default

Signed-off-by: MaximumEntropy <[email protected]>

* Fix to latest ptl

Signed-off-by: MaximumEntropy <[email protected]>

* Add arg to perceiver

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Temporarily add

Signed-off-by: MaximumEntropy <[email protected]>

* Restore

Signed-off-by: MaximumEntropy <[email protected]>

* Move tokens head bias to cfg population

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes to get decode to work.

Signed-off-by: MaximumEntropy <[email protected]>

* More changes

Signed-off-by: MaximumEntropy <[email protected]>

* Update base config

Signed-off-by: MaximumEntropy <[email protected]>

* Test

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Update config to 0 dropout

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Reset file

Signed-off-by: MaximumEntropy <[email protected]>

* Remove scheduler

Signed-off-by: MaximumEntropy <[email protected]>

* Changes

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Support generic bos id

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Minor

Signed-off-by: MaximumEntropy <[email protected]>

* Minor

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Minor changes

Signed-off-by: MaximumEntropy <[email protected]>

* Add embedding dropout

Signed-off-by: MaximumEntropy <[email protected]>

* Changes for ul2

Signed-off-by: MaximumEntropy <[email protected]>

* Fix for pad id

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Update models that can be converted

Signed-off-by: MaximumEntropy <[email protected]>

* Fix inference

Signed-off-by: MaximumEntropy <[email protected]>

* Remove ipdb

Signed-off-by: MaximumEntropy <[email protected]>

* Fix typo

Signed-off-by: MaximumEntropy <[email protected]>

* Load ul2 in bf16

Signed-off-by: MaximumEntropy <[email protected]>

* Add amp o2 arg

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Fix rmsnorm

Signed-off-by: MaximumEntropy <[email protected]>

* Reset config

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix eval for converted models

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Fix

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Update predict step for adapters

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>
Signed-off-by: Jason <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Abhinav Khattar <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: Jason <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Rajesh Ilango <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <[email protected]>
  • Loading branch information
9 people authored and Hainan Xu committed Nov 29, 2022
1 parent 0630726 commit d13c77e
Show file tree
Hide file tree
Showing 31 changed files with 979 additions and 248 deletions.
3 changes: 3 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_bart_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ model:
share_token_embeddings: True # If True share encoder/decoder embeddings
share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits

# token head
tokens_head_bias: True

# precision
native_amp_init_scale: 4294967296 # 2 ** 32
native_amp_growth_interval: 1000
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ num_attention_heads: 12
init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
hidden_dropout: 0.1 # Dropout probability for hidden state transformer.
attention_dropout: 0.1 # Dropout probability in the attention layer.
ffn_dropout: 0.0 # Dropout probability in the feed-forward layer.
position_embedding_type: 'learned_absolute' # Position embedding type. Options ['learned_absolute', 'relative']
relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias
relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets.
Expand All @@ -30,3 +31,5 @@ onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
fp32_residual_connection: False # Use FP32 for residual connections.
activations_checkpoint_method: null # 'uniform', 'block'
activations_checkpoint_num_layers: 1
megatron_legacy: False # Whether to use the legacy Megatron model. This affects the way q,k,v is partitioned from the mixed q,k,v layer in ParallelAttention. This needs to be True for models converted from HF.
normalize_attention_scores: True # Whether to scale the output Q * K^T by 1 / sqrt(hidden_size_per_head). This arg is provided as a configuration option mostly for compatibility with models that have been weight-converted from HF. You almost always want to se this to True.
3 changes: 3 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_t5_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ model:
share_token_embeddings: True # If True share encoder/decoder embeddings
share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits

# token head
tokens_head_bias: True

# precision
native_amp_init_scale: 4294967296 # 2 ** 32
native_amp_growth_interval: 1000
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,11 +71,15 @@ model:
data:
train_ds: ["data/squad_train.jsonl"]
validation_ds: ["data/squad_val.jsonl"]
add_eos: True
shuffle: True
num_workers: 8
pin_memory: True

add_eos: true
add_bos: false
decoder_starts_with_pad: False
add_eos_to_decoder_output: True
add_sentinel_to_input: True
ul2_prompt_token: null # <extra_id_s>, <extra_id_r>, <extra_id_x>
shuffle: true
num_workers: 4
pin_memory: true

optim:
name: fused_adam
Expand All @@ -90,6 +94,4 @@ model:
constant_steps: 0
min_lr: 0.0
monitor: val_loss
reduce_on_plateau: false


reduce_on_plateau: false
3 changes: 3 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_ul2_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ model:
share_token_embeddings: True # If True share encoder/decoder embeddings
share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits

# token head
tokens_head_bias: True

# precision
native_amp_init_scale: 4294967296 # 2 ** 32
native_amp_growth_interval: 1000
Expand Down
50 changes: 45 additions & 5 deletions examples/nlp/language_modeling/megatron_change_num_partitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def merge_partition(model, partitions, write_path=None):
model.save_to(write_path)


def split_partition(model, partitions, tp_size, write_path=None):
def split_partition(model, partitions, tp_size, write_path=None, megatron_legacy=False):
if len(partitions) != 1:
raise ValueError(
"Can only split partitions of model with TP=1. For partitions of models with TP>1, merge first."
Expand All @@ -80,13 +80,38 @@ def split_partition(model, partitions, tp_size, write_path=None):

idx = 0
splits = []
for _, param in model.named_parameters():
for param_name, param in model.named_parameters():
if param.shape == partitions[0][idx].shape:
split = [partitions[0][idx].data] * tp_size
elif param.shape[0] == partitions[0][idx].shape[0]:
split = torch.split(partitions[0][idx].data, param.shape[-1], dim=-1)
else:
split = torch.split(partitions[0][idx].data, param.shape[0], dim=0)
# For T5-converted weights, the splitting needs to be strided such that q,k,v weights are bunched together on each tensor-parallel rank.
if 'query_key_value.weight' in param_name and megatron_legacy:
split_dim = partitions[0][idx].data.shape[0]
if split_dim % (tp_size * 3) != 0:
raise ValueError(
f"Can not split Q,K,V parameter {param_name} with shape {param.shape} into tensor parallel size {tp_size}. Not divisible by {tp_size * 3}."
)
tp_qkv_splits = torch.chunk(partitions[0][idx].data, tp_size * 3, dim=0)
split = []
for i in range(tp_size):
tp_qkv = torch.cat([tp_qkv_splits[item] for item in range(i, tp_size * 3, tp_size)])
split.append(tp_qkv)
elif 'key_value.weight' in param_name and megatron_legacy:
split_dim = partitions[0][idx].data.shape[0]
if split_dim % (tp_size * 2) != 0:
raise ValueError(
f"Can not split K,V parameter {param_name} with shape {param.shape} into tensor parallel size {tp_size}. Not divisible by {tp_size * 2}."
)
tp_qkv_splits = torch.chunk(partitions[0][idx].data, tp_size * 2, dim=0)
split = []
for i in range(tp_size):
tp_qkv = torch.cat([tp_qkv_splits[item] for item in range(i, tp_size * 2, tp_size)])
split.append(tp_qkv)
# Regular split for Megatron and NeMo-Megatron models.
else:
split = torch.split(partitions[0][idx].data, param.shape[0], dim=0)
splits.append(split)
idx += 1

Expand Down Expand Up @@ -134,6 +159,18 @@ def main():
help="NeMo model class. This script should support all NeMo megatron models that use Tensor Parallel",
)
parser.add_argument("--precision", default=16, help="PyTorch Lightning Trainer precision flag")
parser.add_argument(
"--megatron_legacy",
action="store_true",
help="Converter for legacy megatron modles that have different q,k,v weight splits",
)
parser.add_argument(
"--tokenizer_model_path",
type=str,
required=False,
default=None,
help="Path to the tokenizer model path if your model uses a tokenizer model as an artifact. This is needed if your model uses a sentencepiece tokenizer.",
)

args = parser.parse_args()

Expand Down Expand Up @@ -169,6 +206,8 @@ def main():
model.cfg.tensor_model_parallel_size = 1
app_state.model_parallel_size = 1
trainer = Trainer(devices=1, strategy=NLPDDPStrategy(), accelerator="cpu", precision=precision)
if args.tokenizer_model_path is not None:
model.cfg.tokenizer.model = args.tokenizer_model_path
model = cls(model.cfg, trainer).to('cpu')
model._save_restore_connector = NLPSaveRestoreConnector()

Expand All @@ -188,10 +227,11 @@ def main():
model.cfg.tensor_model_parallel_size = tgt_tp_size
app_state.model_parallel_size = tgt_tp_size
trainer = Trainer(devices=1, strategy=NLPDDPStrategy(), accelerator="cpu", precision=precision)
if args.tokenizer_model_path is not None:
model.cfg.tokenizer.model = args.tokenizer_model_path
model = cls(model.cfg, trainer).to('cpu')
model._save_restore_connector = NLPSaveRestoreConnector()

split_partition(model, partitions, tgt_tp_size, args.target_file)
split_partition(model, partitions, tgt_tp_size, args.target_file, args.megatron_legacy)

logging.info("Successfully finished changing partitions!")

Expand Down
21 changes: 20 additions & 1 deletion examples/nlp/language_modeling/megatron_t5_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from argparse import ArgumentParser

import torch
from omegaconf.omegaconf import OmegaConf, open_dict
from pytorch_lightning.trainer.trainer import Trainer
from torch.utils.data import DataLoader

Expand Down Expand Up @@ -47,6 +48,8 @@ def main():
"--pipeline_model_parallel_split_rank", type=int, default=0, required=False,
)
parser.add_argument("--precision", default="16", type=str, help="PyTorch Lightning Trainer precision flag")
parser.add_argument("--decoder_starts_with_pad", action="store_true", help="Decoder starts with pad token")
parser.add_argument("--add_eos_to_encoder_input", action="store_true", help="Encoder input ends with EOS token")
args = parser.parse_args()

# cast precision to int if 32 or 16
Expand Down Expand Up @@ -79,14 +82,30 @@ def main():
pipeline_model_parallel_split_rank_=args.pipeline_model_parallel_split_rank,
)

model_cfg = MegatronT5Model.restore_from(
restore_path=args.model_file,
trainer=trainer,
save_restore_connector=NLPSaveRestoreConnector(),
return_config=True,
)
OmegaConf.set_struct(model_cfg, True)
with open_dict(model_cfg):
model_cfg.precision = trainer.precision

model = MegatronT5Model.restore_from(
restore_path=args.model_file, trainer=trainer, save_restore_connector=NLPSaveRestoreConnector(),
restore_path=args.model_file,
trainer=trainer,
save_restore_connector=NLPSaveRestoreConnector(),
override_config_path=model_cfg,
)
model.freeze()
model.training = False

request = {
"prompt": args.prompt,
"tokens_to_generate": args.tokens_to_generate,
"bos_id": model.tokenizer.pad_id if args.decoder_starts_with_pad else model.tokenizer.bos_id,
"add_eos_to_encoder_input": args.add_eos_to_encoder_input,
}

dataset = T5RequestDataset(request, model.tokenizer)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ def dummy():
outputs = trainer.predict(model, test_dl)
with open(cfg.pred_file_path, "w", encoding="utf-8") as pred_file:
for batch in outputs:
preds = batch["predicted_token_ids"]
preds = batch["preds_text"]
for pred in preds:
pred = pred.strip().replace("\n", " ")
pred_file.write(pred + "\n")
Expand Down
16 changes: 13 additions & 3 deletions examples/nlp/language_modeling/megatron_t5_seq2seq_finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,20 @@ def main(cfg) -> None:
# Override the T5 configuration with the one from the config file.
OmegaConf.set_struct(t5_cfg, True)
with open_dict(t5_cfg):
t5_cfg.masked_softmax_fusion = False
t5_cfg.megatron_amp_O2 = cfg.model.get('megatron_amp_O2', False)
t5_cfg.hidden_dropout = cfg.model.get('hidden_dropout', 0.1)
t5_cfg.attention_dropout = cfg.model.get('attention_dropout', 0.1)
if hasattr(t5_cfg, 'encoder') and hasattr(t5_cfg, 'decoder'):
t5_cfg.encoder.masked_softmax_fusion = False
t5_cfg.decoder.masked_softmax_fusion = False
t5_cfg.encoder.hidden_dropout = cfg.model.get('hidden_dropout', 0.1)
t5_cfg.decoder.hidden_dropout = cfg.model.get('hidden_dropout', 0.1)
if hasattr(t5_cfg.encoder, 'ffn_dropout'):
t5_cfg.encoder.ffn_dropout = cfg.model.get('ffn_dropout', 0.1)
if hasattr(t5_cfg.decoder, 'ffn_dropout'):
t5_cfg.decoder.ffn_dropout = cfg.model.get('ffn_dropout', 0.1)
else:
t5_cfg.hidden_dropout = cfg.model.get('hidden_dropout', 0.1)
t5_cfg.attention_dropout = cfg.model.get('attention_dropout', 0.1)
t5_cfg.masked_softmax_fusion = False
t5_cfg.data = cfg.model.data
t5_cfg.precision = cfg.trainer.precision
t5_cfg.optim = cfg.model.optim
Expand Down
3 changes: 3 additions & 0 deletions examples/nlp/machine_translation/conf/aayn_base_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ model:
share_token_embeddings: True # If True share encoder/decoder embeddings
share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits

# token head
tokens_head_bias: True

# precision
native_amp_init_scale: 4294967296 # 2 ** 32
native_amp_growth_interval: 1000
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ def __init__(self, request: Dict, tokenizer) -> None:
super().__init__()
self.request = request
self.tokenizer = tokenizer
self.add_eos_to_encoder_input = self.request['add_eos_to_encoder_input']

# tokenize prompt
self.request['tokenized_prompt'] = ' '.join(self.tokenizer.text_to_tokens(request['prompt']))
Expand All @@ -96,7 +97,10 @@ def mask_prompt(self, sample):
sample[i] = f'<extra_id_{sentinel_idx}>'
sentinel_idx += 1
sample = ' '.join(sample)
sample = torch.LongTensor(self.tokenizer.text_to_ids(sample))
sample = self.tokenizer.text_to_ids(sample)
if self.add_eos_to_encoder_input:
sample = sample + [self.tokenizer.eos_id]
sample = torch.LongTensor(sample)
self.request['masked_sample'] = sample

def __len__(self):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ def __init__(
documents=None,
):
super().__init__()

# Params to store.
self.name = name
self.seed = seed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,55 @@
from tqdm.auto import tqdm

from nemo.collections.nlp.data.language_modeling.megatron.base_prompt_learning_dataset import BasePromptLearningDataset
from nemo.collections.nlp.models.language_modeling.megatron_t5_model import T5Sentinel
from nemo.collections.nlp.modules.common import VirtualPromptSource
from nemo.collections.nlp.modules.common.megatron.utils import build_position_ids
from nemo.utils import logging

__all__ = ['T5PromptLearningDataset']


class T5Sentinel(enum.Enum):
FIRST = '<extra_id_0>'


class T5PromptLearningDataset(BasePromptLearningDataset):
"""
The dataset class for prompt-tuning or p-tuning pretrained T5 models.
"""

def __init__(self, **kwargs):
super().__init__(**kwargs)
def __init__(
self,
datasets,
tokenizer,
virtual_prompt_source: VirtualPromptSource,
task_templates: dict,
pseudo_tokens,
pad_token_id: str,
max_seq_length: int,
min_seq_length: int = 1,
add_bos: bool = False,
add_eos: bool = True,
for_train: bool = True,
decoder_starts_with_pad: bool = False,
add_eos_to_decoder_output: bool = True,
add_sentinel_to_input: bool = True,
ul2_prompt_token: str = None,
):
# These two variables need to be set before calling super().__init__() because the parent class calls `load_data()` which requires these attributes.
self.decoder_starts_with_pad = decoder_starts_with_pad
self.add_eos_to_decoder_output = add_eos_to_decoder_output
self.add_sentinel_to_input = add_sentinel_to_input
self.ul2_prompt_token = ul2_prompt_token
super().__init__(
datasets=datasets,
tokenizer=tokenizer,
virtual_prompt_source=virtual_prompt_source,
task_templates=task_templates,
pseudo_tokens=pseudo_tokens,
pad_token_id=pad_token_id,
max_seq_length=max_seq_length,
min_seq_length=min_seq_length,
add_bos=add_bos,
add_eos=add_eos,
for_train=for_train,
)

def load_data(self, dataset):
"""
Expand All @@ -50,7 +81,6 @@ def load_data(self, dataset):
containing the information needed for a training example
"""
skipped = 0

for json_line in tqdm(dataset):

# Read example dict or load the information for a single example from .json file
Expand Down Expand Up @@ -84,9 +114,15 @@ def load_data(self, dataset):
input_example = self._insert_virtual_token_placeholders(input_example, virtual_token_splits)

# a trick to align with the data format in t5 pretraining
input_ids = self.tokenizer.text_to_ids(input_example) + self.tokenizer.text_to_ids(T5Sentinel.FIRST.value)
input_ids = self.tokenizer.text_to_ids(input_example)
if self.add_sentinel_to_input:
input_ids = input_ids + self.tokenizer.text_to_ids(T5Sentinel.FIRST.value)

# Add BOS/EOS to the input of encoder if desired, adds EOS by default
if self.ul2_prompt_token is not None:
ul2_prompt_token_id = self.tokenizer.text_to_ids(self.ul2_prompt_token)
assert len(ul2_prompt_token_id) == 1
input_ids = ul2_prompt_token_id + input_ids
if self.add_bos:
input_ids = [self.tokenizer.bos_id] + input_ids
if self.add_eos:
Expand All @@ -100,13 +136,18 @@ def load_data(self, dataset):
if answer_field in doc.keys(): # training and validation
answer_text = doc[answer_field]

if self.decoder_starts_with_pad:
answer_text_ids = [self.tokenizer.pad_id]
else:
answer_text_ids = [self.tokenizer.bos_id]
# a trick to align with the data format in t5 pretraining
answer_text_ids = (
[self.tokenizer.bos_id]
+ self.tokenizer.text_to_ids(T5Sentinel.FIRST.value)
+ self.tokenizer.text_to_ids(answer_text)
+ [self.tokenizer.eos_id]
)
if self.add_sentinel_to_input:
answer_text_ids += self.tokenizer.text_to_ids(T5Sentinel.FIRST.value)
answer_text_ids += self.tokenizer.text_to_ids(answer_text)
if self.add_eos_to_decoder_output:
answer_text_ids += [self.tokenizer.eos_id]
else:
answer_text_ids += self.tokenizer.text_to_ids(T5Sentinel.END.value)

# Skip example if the final length doesn't fit length requirements even after truncation
if self.min_seq_length <= len(input_ids) <= self.max_seq_length:
Expand Down
Loading

0 comments on commit d13c77e

Please sign in to comment.