Add valid data (+TVN fixes) #143

TevenLeScao · 2021-10-18T16:39:40Z

Re-creating the pull request at #113 so I can push directly

TevenLeScao · 2021-10-18T19:32:06Z

I've taken over the PR from @sbmaruf to change the behaviour of the validation argument when the same dataset is passed to train and to validation. In this case, I re-use the pre-existing split in train and validation that was created previously, instead of putting 100% of the dataset in a validation split (which would create overlap between train and valid).

stas00

Thank you for working on it, Teven.

I left a few minor suggestions.

My main concern is why a feature is being added w/o tests. The included test scripts are good for someone developing a feature, but are terrible at catching some change breaking a feature down the road when something changes.

We have a test suite, let's use it. If you need help with figuring out how to accomplish that I'm here to help.

stas00 · 2021-10-18T23:03:24Z

megatron/data/gpt_dataset.py

+        print(f"split: {splits_string}")
        if train_ds:
            train_datasets.append(train_ds)
+            print(f"train_ds size: {len(train_ds)}")
        if valid_ds:
-            valid_datasets.append(valid_ds)
+            valid_datasets_dict[prefix] = valid_ds
+            print(f"valid_ds size: {len(valid_ds)}")
        if test_ds:
            test_datasets.append(test_ds)


please careful with adding direct prints as some of these will be replicated 512 times or more. You want print_rank_0 instead.

Also should the test_ds branch have one for consistency?

same in the rest of this PR

Changed the prints ! Half thinking we should replace every single print in the repo by print_rank_0 now

I was on the fence on the test_ds - I didn't want to add more code for something that we don't currently use. I'll try to think of a way to factor it

stas00 · 2021-10-18T23:05:41Z

megatron/data/gpt_dataset.py

+                print(f"prefix: {prefix} not found in {valid_datasets_dict.keys()}")
+                # create the ones from the arguments that are missing
+                train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
+                    valid_prefixes[i], data_impl, '0,100,0',


should the presence of this new flag require the user to pass the normal data-path split so that the 2nd split is 0, e.g. 100,0,0 or 90,0,10?

Will this make things simpler?

Or I'd go even simpler and require no split argument then, if a separate validation data path is passed then do no splitting at all. i.e. data-path => Train, valid-data-path => Valid (and we can add test-data-path if needed).

Yet another possible solution:

Leave --data-path feature as is, and have --(train|valid|test)-data-path arg and make these mutually exclusive.

So it's one of the two sets:

--data-path and --split pair

--(train|valid|test)-data-path (3 args) and no --split and test is optional.

The originally proposed solution feels too much of a "patch".

Something I feel we need with this feature is to deal with datasets that only have a train split and no validation, including OSCAR which is why I wrote it like this. In this case you still want to be able to act on your validation mix even though you don't have a separate file to point to: for example, you want to use only one of the validation splits you created in the preceding step rather the mix of all.

If you point to a file that's not already in the training set, e.g. an external validation set, then it feels natural to use it all. We can easily add another flag to split it but I cannot see a usecase for it. (edit: now I can see a usecase for it - making the validation split not too big - although you could probably fix that with the valid_weights)

If I understand your correctly that's why I suggested to have 2 different sets of APIs:

--data-path and --split pair - normal use

--(train|valid|test)-data-path (3 args) and no --split and test is optional. For when you want to take full control over the splits.

I think this would make things easier to use.

but perhaps I'm not seeing some nuance here.

It wasn't bearing, the code needed to be clearer anyway. The weights of valid-data-path apply, which is clear since we're also using the data from valid-data-path!

I am not sure what you call a wrong use - in the OSCAR multilingual experiments, we 1. do not have a separate validation set 2. we need to restrict the validation to only a language or set of languages. This naturally leads to passing a path that overrides the original validation mix but also has overlap with it.

I agree with @stas00

Why are you selecting validation data from training data? You can just create data in --valid-path and take it from there. To me it seems redundant.

The implementation was,

You only take validation data from training when you don't have any data path in--valid-data-path argument. When you are taking data from both --data-path and --valid-data-path, that's so confusing to me.

I understand this is needed for OSCAR. But I am against it because in this way, we cannot track which samples are selected for training and which one for development. IMO if oscar doesn't have development set, we should create our own split and share that split for re-producibility. This is also important because we are doing comparison with multiple datasets.

Another problem,

Let's assume a case,

TRAIN_DATASET=DATASET_0 VALID_DATASET="0.1 ${DATASET_0} 0.25 ${DATASET_1} 0.2 ${DATASET_2} 0.15 ${DATASET_3} 0.3 ${DATASET_4}"

when you run this with --split 80,20,0, in addition to validation data, 20% of training data with be sampled with a relative probability 0.1. For a normal user, this seems overly complicated.

@TevenLeScao

Yes, this is different from your implementation - as discussed above, this only takes data from --valid-data-path and not from --data-path. We can talk about the need for a separate development set for OSCAR but it is not the only dataset in this case as language modeling generally doesn't need validation sets. I'd rather give the user (i.e. us!) this flexilibity rather than forcing them to make their own splits.

Also the case you've shown does exactly what the user wants - it samples dataset_0 exactly how much they expect, while ensuring it picks the parts that don't overlap with training. If the user feels that is too complicated, they can also just split before the processing and not have overlapping data-paths and valid-data-paths, so there's no harm.

I'm glad to hear that it's not just me who finds the proposed API confusing and the fact that were are still trying to figure it out is an indication of that.

May I re-iterate a different proposal, I suggested yesterday,

where we have two mutually exclusive modes:

--data-path and --split pair - normal use

--(train|valid|test)-data-path (3 args) and no --split and test is optional. For when you want to take full control over the splits.

I think this would make things easier to use, since the 2nd approach is very explicit.

megatron/arguments.py

TevenLeScao · 2021-10-18T23:29:52Z

Yes I also feel that we need a test - I kept the original debugging suite where it was for now as I wasn't sure how to integrate it to the existing tests (and I wanted to launch a training tonight) but I would like to move it.

stas00 · 2021-10-18T23:36:32Z

You can run the training from the PR branch, just don't update it once started.
You can merge this and make issues that you'll follow through to complete things at a later time. since we are our own and sole users we have that flexibility.

stas00 · 2021-10-18T23:39:35Z

re: testing - we have 2 types of tests:

does it run
does it do the right thing.

the first type of test is trivial, just clone one of the existing main training tests (test_training.py) and add the new options, done (we will then refactor repetition as tests add up).

the second group is think of how to validate this, and you can also leave it as a separate Issue and perhaps someone will contribute the missing part of the test.

but at the very least let's make sure the framework runs with the new features.

for example you can test the generation of the correct index files as a very simple validation that it does the right thing.

stas00 · 2021-10-18T23:41:28Z

And as you can see this PR breaks the test suite ;) test_training.py

           Traceback (most recent call last):
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_prefix_lm.py", line 225, in <module>
E               pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 153, in pretrain
E               = build_train_valid_test_data_iterators(
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 961, in build_train_valid_test_data_iterators
E               train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_prefix_lm.py", line 185, in train_valid_test_datasets_provider
E               train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 40, in build_train_valid_test_datasets
E               return _build_train_valid_test_datasets(data_prefix[0],
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 153, in _build_train_valid_test_datasets
E               train_dataset = build_dataset(0, 'train')
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 147, in build_dataset
E               dataset = GPTDataset(name, data_prefix,
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 189, in __init__
E               self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
E             File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 267, in _build_index_mappings
E               print_rank_0(' > only one epoch required, setting '
E           TypeError: print_rank_0() got an unexpected keyword argument 'flush'

FAILED tests/test_training.py::MegDSTestTraining::test_training_all_0_base - ...
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_1_cl - Ru...
FAILED tests/test_training.py::MegDSTestTraining::test_training_prefix_lm_all

TevenLeScao · 2021-10-18T23:46:42Z

Looks like I was too careless with my replacing of prints by print_rank_0s (fixed now), the rest is codecarbon! (athough codecarbon seems to shut down gracefully)

stas00 · 2021-10-18T23:52:00Z

yes, CC continues to be a problem, but these particular exceptions are just noise. So as long the test suite passes, you can ignore those for now. I already reported those to CC.

stas00 · 2021-10-19T01:19:23Z

megatron/data/gpt_dataset.py

+        valid_prefixes, valid_weights, valid_datasets_samples = valid_output
+        for i, prefix in enumerate(valid_prefixes):
+            if prefix not in ds_shared_with_train:
+                print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}")


Suggested change

print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}")

print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}")

This looks more like a debug and won't be of any added value to the user, IMHO, and just add to the noise.

What do you think?

agreed, I'll remove some of the prints

TevenLeScao · 2021-10-27T16:46:26Z

Superseded by #97 .

sbmaruf and others added 6 commits September 22, 2021 01:02

Add argument --valid-data

d8f4efd

implement valid data-loader

c2ad1b5

bash script for dummy test

8f8dd8a

small test model

6815024

more flexibility in how to use validation sets

3c252b6

removed some debug

57c255f

TevenLeScao requested review from ibeltagy and stas00 October 18, 2021 19:29

stas00 suggested changes Oct 18, 2021

View reviewed changes

TevenLeScao added 2 commits October 19, 2021 01:17

print to print_rank_0

861e60f

valid-data arg to valid-data-path

b0777bd

fixed print_rank_0 flush

4e2794b

TevenLeScao added 3 commits October 19, 2021 02:29

trying to make code clearer

cf59b43

fixed misleading comments

db30753

another comment

56dc51d

stas00 reviewed Oct 19, 2021

View reviewed changes

stas00 mentioned this pull request Oct 25, 2021

Adding language specific validation sets for Multilingual model training #97

Merged

TevenLeScao closed this Oct 27, 2021

	print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}")
	print_rank_0(f"prefix: {prefix} not found in {ds_shared_with_train.keys()}")

Add valid data (+TVN fixes) #143

Add valid data (+TVN fixes) #143

Uh oh!

Conversation

TevenLeScao commented Oct 18, 2021

Uh oh!

TevenLeScao commented Oct 18, 2021

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TevenLeScao Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbmaruf Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TevenLeScao commented Oct 18, 2021

Uh oh!

stas00 commented Oct 18, 2021

Uh oh!

stas00 commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TevenLeScao commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Oct 18, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TevenLeScao commented Oct 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stas00 Oct 18, 2021 •

edited

Loading

TevenLeScao Oct 18, 2021 •

edited

Loading

sbmaruf Oct 19, 2021 •

edited

Loading

stas00 Oct 19, 2021 •

edited

Loading

stas00 commented Oct 18, 2021 •

edited

Loading

stas00 commented Oct 18, 2021 •

edited

Loading

TevenLeScao commented Oct 18, 2021 •

edited

Loading