-
Notifications
You must be signed in to change notification settings - Fork 228
Add valid data (+TVN fixes) #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I've taken over the PR from @sbmaruf to change the behaviour of the validation argument when the same dataset is passed to train and to validation. In this case, I re-use the pre-existing split in train and validation that was created previously, instead of putting 100% of the dataset in a validation split (which would create overlap between train and valid). |
stas00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on it, Teven.
I left a few minor suggestions.
My main concern is why a feature is being added w/o tests. The included test scripts are good for someone developing a feature, but are terrible at catching some change breaking a feature down the road when something changes.
We have a test suite, let's use it. If you need help with figuring out how to accomplish that I'm here to help.
megatron/data/gpt_dataset.py
Outdated
| print(f"split: {splits_string}") | ||
| if train_ds: | ||
| train_datasets.append(train_ds) | ||
| print(f"train_ds size: {len(train_ds)}") | ||
| if valid_ds: | ||
| valid_datasets.append(valid_ds) | ||
| valid_datasets_dict[prefix] = valid_ds | ||
| print(f"valid_ds size: {len(valid_ds)}") | ||
| if test_ds: | ||
| test_datasets.append(test_ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please careful with adding direct prints as some of these will be replicated 512 times or more. You want print_rank_0 instead.
Also should the test_ds branch have one for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same in the rest of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the prints ! Half thinking we should replace every single print in the repo by print_rank_0 now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the fence on the test_ds - I didn't want to add more code for something that we don't currently use. I'll try to think of a way to factor it
| print(f"prefix: {prefix} not found in {valid_datasets_dict.keys()}") | ||
| # create the ones from the arguments that are missing | ||
| train_ds, valid_ds, test_ds = _build_train_valid_test_datasets( | ||
| valid_prefixes[i], data_impl, '0,100,0', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the presence of this new flag require the user to pass the normal data-path split so that the 2nd split is 0, e.g. 100,0,0 or 90,0,10?
Will this make things simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I'd go even simpler and require no split argument then, if a separate validation data path is passed then do no splitting at all. i.e. data-path => Train, valid-data-path => Valid (and we can add test-data-path if needed).
Yet another possible solution:
Leave --data-path feature as is, and have --(train|valid|test)-data-path arg and make these mutually exclusive.
So it's one of the two sets:
--data-pathand--splitpair--(train|valid|test)-data-path(3 args) and no--splitand test is optional.
The originally proposed solution feels too much of a "patch".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I feel we need with this feature is to deal with datasets that only have a train split and no validation, including OSCAR which is why I wrote it like this. In this case you still want to be able to act on your validation mix even though you don't have a separate file to point to: for example, you want to use only one of the validation splits you created in the preceding step rather the mix of all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you point to a file that's not already in the training set, e.g. an external validation set, then it feels natural to use it all. We can easily add another flag to split it but I cannot see a usecase for it. (edit: now I can see a usecase for it - making the validation split not too big - although you could probably fix that with the valid_weights)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand your correctly that's why I suggested to have 2 different sets of APIs:
--data-pathand--splitpair - normal use--(train|valid|test)-data-path(3 args) and no--splitand test is optional. For when you want to take full control over the splits.
I think this would make things easier to use.
but perhaps I'm not seeing some nuance here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't bearing, the code needed to be clearer anyway. The weights of valid-data-path apply, which is clear since we're also using the data from valid-data-path!
I am not sure what you call a wrong use - in the OSCAR multilingual experiments, we 1. do not have a separate validation set 2. we need to restrict the validation to only a language or set of languages. This naturally leads to passing a path that overrides the original validation mix but also has overlap with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @stas00
Why are you selecting validation data from training data? You can just create data in --valid-path and take it from there. To me it seems redundant.
The implementation was,
You only take validation data from training when you don't have any data path in--valid-data-path argument. When you are taking data from both --data-path and --valid-data-path, that's so confusing to me.
I understand this is needed for OSCAR. But I am against it because in this way, we cannot track which samples are selected for training and which one for development. IMO if oscar doesn't have development set, we should create our own split and share that split for re-producibility. This is also important because we are doing comparison with multiple datasets.
Another problem,
Let's assume a case,
TRAIN_DATASET=DATASET_0
VALID_DATASET="0.1 ${DATASET_0} 0.25 ${DATASET_1} 0.2 ${DATASET_2} 0.15 ${DATASET_3} 0.3 ${DATASET_4}"
when you run this with --split 80,20,0, in addition to validation data, 20% of training data with be sampled with a relative probability 0.1. For a normal user, this seems overly complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is different from your implementation - as discussed above, this only takes data from --valid-data-path and not from --data-path. We can talk about the need for a separate development set for OSCAR but it is not the only dataset in this case as language modeling generally doesn't need validation sets. I'd rather give the user (i.e. us!) this flexilibity rather than forcing them to make their own splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the case you've shown does exactly what the user wants - it samples dataset_0 exactly how much they expect, while ensuring it picks the parts that don't overlap with training. If the user feels that is too complicated, they can also just split before the processing and not have overlapping data-paths and valid-data-paths, so there's no harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to hear that it's not just me who finds the proposed API confusing and the fact that were are still trying to figure it out is an indication of that.
May I re-iterate a different proposal, I suggested yesterday,
where we have two mutually exclusive modes:
--data-pathand--splitpair - normal use--(train|valid|test)-data-path(3 args) and no--splitand test is optional. For when you want to take full control over the splits.
I think this would make things easier to use, since the 2nd approach is very explicit.
|
Yes I also feel that we need a test - I kept the original debugging suite where it was for now as I wasn't sure how to integrate it to the existing tests (and I wanted to launch a training tonight) but I would like to move it. |
|
|
re: testing - we have 2 types of tests:
the first type of test is trivial, just clone one of the existing main training tests (test_training.py) and add the new options, done (we will then refactor repetition as tests add up). the second group is think of how to validate this, and you can also leave it as a separate Issue and perhaps someone will contribute the missing part of the test. but at the very least let's make sure the framework runs with the new features. for example you can test the generation of the correct index files as a very simple validation that it does the right thing. |
|
And as you can see this PR breaks the test suite ;) |
|
Looks like I was too careless with my replacing of |
|
yes, CC continues to be a problem, but these particular exceptions are just noise. So as long the test suite passes, you can ignore those for now. I already reported those to CC. |
| valid_prefixes, valid_weights, valid_datasets_samples = valid_output | ||
| for i, prefix in enumerate(valid_prefixes): | ||
| if prefix not in ds_shared_with_train: | ||
| print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}") | |
| print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}") |
This looks more like a debug and won't be of any added value to the user, IMHO, and just add to the noise.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I'll remove some of the prints
|
Superseded by #97 . |
Re-creating the pull request at #113 so I can push directly