Skip to content

Commit

Permalink
[BUG] Fix build train valid test datasets (PaddlePaddle#8823)
Browse files Browse the repository at this point in the history
  • Loading branch information
JunnYu authored and DesmonDay committed Sep 5, 2024
1 parent c18d7b7 commit dd31fe4
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion paddlenlp/data/causal_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,9 @@ def build_train_valid_test_datasets(
# Parse the values.
output = get_datasets_weights_and_num_samples(data_prefix, train_val_test_num_samples)
prefixes, weights, datasets_train_valid_test_num_samples = output
train_num_samples, valid_num_samples, test_num_samples = map(sum, zip(*datasets_train_valid_test_num_samples))
# NOTE: megatron/gpt_dataset.py has been updated. When creating BlendableDataset, we will use the raw train_val_test_num_samples instead of the expanded ones.
# Please refer to https://github.com/NVIDIA/NeMo/blob/72f630d087d45655b1a069dc72debf01dfdbdb2d/nemo/collections/nlp/data/language_modeling/megatron/gpt_dataset.py#L74-L80 for more information
train_num_samples, valid_num_samples, test_num_samples = datasets_train_valid_test_num_samples

# Build individual datasets.
train_datasets = []
Expand Down

0 comments on commit dd31fe4

Please sign in to comment.