Skip to content

Conversation

@TevenLeScao
Copy link
Collaborator

Re-creating the pull request at #113 so I can push directly

@TevenLeScao
Copy link
Collaborator Author

I've taken over the PR from @sbmaruf to change the behaviour of the validation argument when the same dataset is passed to train and to validation. In this case, I re-use the pre-existing split in train and validation that was created previously, instead of putting 100% of the dataset in a validation split (which would create overlap between train and valid).

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on it, Teven.

I left a few minor suggestions.

My main concern is why a feature is being added w/o tests. The included test scripts are good for someone developing a feature, but are terrible at catching some change breaking a feature down the road when something changes.

We have a test suite, let's use it. If you need help with figuring out how to accomplish that I'm here to help.

Comment on lines 63 to 71
print(f"split: {splits_string}")
if train_ds:
train_datasets.append(train_ds)
print(f"train_ds size: {len(train_ds)}")
if valid_ds:
valid_datasets.append(valid_ds)
valid_datasets_dict[prefix] = valid_ds
print(f"valid_ds size: {len(valid_ds)}")
if test_ds:
test_datasets.append(test_ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please careful with adding direct prints as some of these will be replicated 512 times or more. You want print_rank_0 instead.

Also should the test_ds branch have one for consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in the rest of this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the prints ! Half thinking we should replace every single print in the repo by print_rank_0 now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the fence on the test_ds - I didn't want to add more code for something that we don't currently use. I'll try to think of a way to factor it

print(f"prefix: {prefix} not found in {valid_datasets_dict.keys()}")
# create the ones from the arguments that are missing
train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
valid_prefixes[i], data_impl, '0,100,0',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the presence of this new flag require the user to pass the normal data-path split so that the 2nd split is 0, e.g. 100,0,0 or 90,0,10?

Will this make things simpler?

Copy link
Contributor

@stas00 stas00 Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I'd go even simpler and require no split argument then, if a separate validation data path is passed then do no splitting at all. i.e. data-path => Train, valid-data-path => Valid (and we can add test-data-path if needed).

Yet another possible solution:

Leave --data-path feature as is, and have --(train|valid|test)-data-path arg and make these mutually exclusive.

So it's one of the two sets:

  1. --data-path and --split pair
  2. --(train|valid|test)-data-path (3 args) and no --split and test is optional.

The originally proposed solution feels too much of a "patch".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I feel we need with this feature is to deal with datasets that only have a train split and no validation, including OSCAR which is why I wrote it like this. In this case you still want to be able to act on your validation mix even though you don't have a separate file to point to: for example, you want to use only one of the validation splits you created in the preceding step rather the mix of all.

Copy link
Collaborator Author

@TevenLeScao TevenLeScao Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you point to a file that's not already in the training set, e.g. an external validation set, then it feels natural to use it all. We can easily add another flag to split it but I cannot see a usecase for it. (edit: now I can see a usecase for it - making the validation split not too big - although you could probably fix that with the valid_weights)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your correctly that's why I suggested to have 2 different sets of APIs:

  1. --data-path and --split pair - normal use
  2. --(train|valid|test)-data-path (3 args) and no --split and test is optional. For when you want to take full control over the splits.

I think this would make things easier to use.

but perhaps I'm not seeing some nuance here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't bearing, the code needed to be clearer anyway. The weights of valid-data-path apply, which is clear since we're also using the data from valid-data-path!

I am not sure what you call a wrong use - in the OSCAR multilingual experiments, we 1. do not have a separate validation set 2. we need to restrict the validation to only a language or set of languages. This naturally leads to passing a path that overrides the original validation mix but also has overlap with it.

Copy link
Collaborator

@sbmaruf sbmaruf Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @stas00

Why are you selecting validation data from training data? You can just create data in --valid-path and take it from there. To me it seems redundant.

The implementation was,

You only take validation data from training when you don't have any data path in--valid-data-path argument. When you are taking data from both --data-path and --valid-data-path, that's so confusing to me.

I understand this is needed for OSCAR. But I am against it because in this way, we cannot track which samples are selected for training and which one for development. IMO if oscar doesn't have development set, we should create our own split and share that split for re-producibility. This is also important because we are doing comparison with multiple datasets.

Another problem,

Let's assume a case,

TRAIN_DATASET=DATASET_0
VALID_DATASET="0.1 ${DATASET_0} 0.25 ${DATASET_1} 0.2 ${DATASET_2} 0.15 ${DATASET_3} 0.3 ${DATASET_4}"

when you run this with --split 80,20,0, in addition to validation data, 20% of training data with be sampled with a relative probability 0.1. For a normal user, this seems overly complicated.

@TevenLeScao

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is different from your implementation - as discussed above, this only takes data from --valid-data-path and not from --data-path. We can talk about the need for a separate development set for OSCAR but it is not the only dataset in this case as language modeling generally doesn't need validation sets. I'd rather give the user (i.e. us!) this flexilibity rather than forcing them to make their own splits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the case you've shown does exactly what the user wants - it samples dataset_0 exactly how much they expect, while ensuring it picks the parts that don't overlap with training. If the user feels that is too complicated, they can also just split before the processing and not have overlapping data-paths and valid-data-paths, so there's no harm.

Copy link
Contributor

@stas00 stas00 Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to hear that it's not just me who finds the proposed API confusing and the fact that were are still trying to figure it out is an indication of that.

May I re-iterate a different proposal, I suggested yesterday,

where we have two mutually exclusive modes:

  1. --data-path and --split pair - normal use
  2. --(train|valid|test)-data-path (3 args) and no --split and test is optional. For when you want to take full control over the splits.

I think this would make things easier to use, since the 2nd approach is very explicit.

@TevenLeScao
Copy link
Collaborator Author

Yes I also feel that we need a test - I kept the original debugging suite where it was for now as I wasn't sure how to integrate it to the existing tests (and I wanted to launch a training tonight) but I would like to move it.

@stas00
Copy link
Contributor

stas00 commented Oct 18, 2021

  • You can run the training from the PR branch, just don't update it once started.
  • You can merge this and make issues that you'll follow through to complete things at a later time. since we are our own and sole users we have that flexibility.

@stas00
Copy link
Contributor

stas00 commented Oct 18, 2021

re: testing - we have 2 types of tests:

  1. does it run
  2. does it do the right thing.

the first type of test is trivial, just clone one of the existing main training tests (test_training.py) and add the new options, done (we will then refactor repetition as tests add up).

the second group is think of how to validate this, and you can also leave it as a separate Issue and perhaps someone will contribute the missing part of the test.

but at the very least let's make sure the framework runs with the new features.

for example you can test the generation of the correct index files as a very simple validation that it does the right thing.

@stas00
Copy link
Contributor

stas00 commented Oct 18, 2021

And as you can see this PR breaks the test suite ;) test_training.py

           Traceback (most recent call last):
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_prefix_lm.py", line 225, in <module>
E               pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 153, in pretrain
E               = build_train_valid_test_data_iterators(
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 961, in build_train_valid_test_data_iterators
E               train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_prefix_lm.py", line 185, in train_valid_test_datasets_provider
E               train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 40, in build_train_valid_test_datasets
E               return _build_train_valid_test_datasets(data_prefix[0],
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 153, in _build_train_valid_test_datasets
E               train_dataset = build_dataset(0, 'train')
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 147, in build_dataset
E               dataset = GPTDataset(name, data_prefix,
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 189, in __init__
E               self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 267, in _build_index_mappings
E               print_rank_0(' > only one epoch required, setting '
E           TypeError: print_rank_0() got an unexpected keyword argument 'flush'
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_0_base - ...
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_1_cl - Ru...
FAILED tests/test_training.py::MegDSTestTraining::test_training_prefix_lm_all

@TevenLeScao
Copy link
Collaborator Author

TevenLeScao commented Oct 18, 2021

Looks like I was too careless with my replacing of prints by print_rank_0s (fixed now), the rest is codecarbon! (athough codecarbon seems to shut down gracefully)

@stas00
Copy link
Contributor

stas00 commented Oct 18, 2021

yes, CC continues to be a problem, but these particular exceptions are just noise. So as long the test suite passes, you can ignore those for now. I already reported those to CC.

valid_prefixes, valid_weights, valid_datasets_samples = valid_output
for i, prefix in enumerate(valid_prefixes):
if prefix not in ds_shared_with_train:
print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}")
print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}")

This looks more like a debug and won't be of any added value to the user, IMHO, and just add to the noise.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I'll remove some of the prints

@TevenLeScao
Copy link
Collaborator Author

Superseded by #97 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants