[WIP] Enable reproducibility for distributed trainings #16907

hasansalimkanmaz · 2022-04-23T14:26:48Z

What does this PR do?

This PR ensures reproducibility for distributed trainings by setting seed for worker in dataloader and setting environment variables for cuda.

This PR is motivated by this issue.

Who can review?

@saattrupdan @sgugger I am looking forward to your feedback

…ce reproducability

HuggingFaceDocBuilderDev · 2022-04-23T14:40:54Z

The documentation is not available anymore as the PR was closed or merged.

saattrupdan

Thanks for investing your time to implement this PR! 😊

I have mostly small changes related to documentation and naming, but otherwise looks good 👍

EDIT: To enable support for Tensorflow models, you could add use the enable_op_determinism in the Tensorflow case.

src/transformers/trainer_utils.py

saattrupdan · 2022-04-24T15:11:59Z

src/transformers/trainer_utils.py

+        torch.backends.cudnn.benchmark = False
+
+
+def set_seed(seed: int, set_seed_for_cuda: bool = True):


Related to the function name above, I'd argue that the argument here should be changed to something like enable_determinism. Further, I'd make the default False, as enabling it can cause weird errors, if one uses algorithms that don't have a deterministic variant yet.

src/transformers/trainer.py

…bility support for tf

sgugger

Thanks for your work on this!

sgugger · 2022-04-25T12:03:07Z

src/transformers/trainer_utils.py

+        tf.config.experimental.enable_op_determinism()
+
+
+def set_seed(seed: int, enable_determinism: bool = True):


Suggested change

def set_seed(seed: int, enable_determinism: bool = True):

def set_seed(seed: int, full_determinism: bool = False):

I like full_determinism a bit better. Since this is a new addition, the default should be set to False. Although it does fix what one might consider a bug, so I'm not sure on this one. @LysandreJik do you have an opinion?

src/transformers/trainer.py

LysandreJik

Thanks for working on this, that's an important feature! So as to not introduce a breaking change, and for clarity of the API, I'd personally vouch for not adding the enable_determinism flag to the set_seed method.

From the title of the method I understand it should set the seed, and that's it. I don't think it should do anything else. However, the enable_determinism_for_distributed_training method likely needs the seed to be set in order to benefit from full determinism, so I'd even push to have the set_seed method called inside the enable_determinism_for_distributed_training, adding a seed argument to that last method.

What do you think?

hasansalimkanmaz · 2022-04-27T11:13:26Z

Thanks for working on this, that's an important feature! So as to not introduce a breaking change, and for clarity of the API, I'd personally vouch for not adding the enable_determinism flag to the set_seed method.

From the title of the method I understand it should set the seed, and that's it. I don't think it should do anything else. However, the enable_determinism_for_distributed_training method likely needs the seed to be set in order to benefit from full determinism, so I'd even push to have the set_seed method called inside the enable_determinism_for_distributed_training, adding a seed argument to that last method.

What do you think?

I like this idea. I can implement it after we reach a conclusion on it, however, it is not clear to me how to implement it. Could you point me to which parts of the code I need to change/pay attention not to break anything if we decide to go for this idea?

sgugger

Here are some pointer on what @LysandreJik suggests.

sgugger · 2022-04-27T11:50:54Z

src/transformers/trainer_utils.py

+    set_seed(worker_seed)
+
+
+def enable_determinism_for_distributed_training():


The idea would be for this function to take seed here.

Suggested change

def enable_determinism_for_distributed_training():

def enable_full_determinism(seed: int):

and then call set_seed inside (instead of set_seed calling this function).

(Also changing the name to be a bit shorter.)

sgugger · 2022-04-27T11:51:21Z

src/transformers/trainer_utils.py

+    if enable_determinism:
+        enable_determinism_for_distributed_training()


And so this part would disappear here, it would be the other way around.

…minism

…ity-during-distributed-training

hasansalimkanmaz · 2022-05-10T17:24:13Z

@sgugger Thanks for the pointers and sorry for not being so clear. I would like to know in which places enable_full_determinism should be called. Currently, set_seed is called several places in the codebase. I don't think these calls will be replaced with enable_full_determinism.

With the latest commits, I have already addressed your pointers. Now I am waiting your feedback for where to call enable_full_determinism in the codebase. It is not called any place in the codebase right now.

sgugger · 2022-05-10T18:30:32Z

There can be an added flag in the TrainingArguments and we can call this function instead of set_seed in the Trainer. Otherwise it will be for the users to use this one instead of set_seed in their own scripts (you should make it accessible in the main init by the way!)

…minism when it is true

hasansalimkanmaz · 2022-05-11T03:34:31Z

@sgugger I think I have addressed all your comments. Is there anything that needs to be done for this PR?

sgugger

Thanks, just one last comment on the doc!

src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <[email protected]>

hasansalimkanmaz · 2022-05-11T13:09:44Z

Is it normal that 3 tests fail suddenly after a commit in a docstring? I couldn't understand why tests are failing.

sgugger · 2022-05-11T13:37:09Z

Those are just flaky, no link to your PR. Thanks again for all your work on this!

…6907) * add seed worker and set_deterministic_seed_for_cuda function to enforce reproducability * change function name to enable determinism, add docstrings, reproducability support for tf * change function name to enable_determinism_for_distributed_training * revert changes in set_seed and call set_seed within enable_full_determinism * add one position argument for seed_worker function * add full_determinism flag in training args and call enable_full_determinism when it is true * add enable_full_determinism to documentation * apply make fixup after the last commit * Update src/transformers/training_args.py Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]>

alexcoca · 2022-05-18T23:08:17Z

@sgugger @hasansalimkanmaz I had a question about this PR - why is it necessary to set CUDA_LAUNCH_BLOCKING? This disables asynchronous execution of CUDA programs, but the cuda/pytorch docs don't mention it necessary for deterministic training? I do use it to get the "true" stack trace when there are device-side asserts but was wondering what role it plays in deterministic training. Many thanks!

saattrupdan · 2022-05-19T08:11:54Z

@alexcoca It's required to make some CUDA algorithms deterministic if the CUDA version is older than 10.2. I suppose it could be replaced by a CUDA version check somehow, and only using it if it's an old version?

alexcoca · 2022-05-19T08:43:46Z

@saattrupdan I would go for this approach, because running the CUDA programs in asynchronous mode will definitely slow things down beyond belief. I implemented this PR myself without the CUDA_LAUNCH_BLOCKING setting and will report if I manage to preserve determinism.

alexcoca · 2022-05-20T10:04:56Z

I experimented with training a dialogue state tracking model on the SGD corpus starting from Google's v1.1 T5 (220M) paramaters. I allowed the model to train for roughly two epochs and evaluated task oriented performance every 2k steps (max train steps was 12k).

Ran 4 experiments: 2 in which I set the seed, and an additional 2 where I do roughly the same as ensure_determinism except setting CUDA_LAUNCH_BLOCKING. I also set CUBLAS_WORKSPACE_CONFIG=':4096:8'. Each experiment was trained on 2 A100-80GB with cuda/11.4 openmpi/4.1.1/gcc-9.4.0-epagguv, pytorch 1.10 and transformers 4.19.2. You can see below that I was able to reproduce the metrics in all runs and with no major performance hits. I guess that convolution benchmarking and non-det ops are less relevant for T5. With 4.18.0 the performance was wreaking havoc on the same seed, sign that the data ordering was the culprit.

I guess the moral of the story here is that one could:

Check CUDA version to avoid running in blocking mode when not necessary
potentially allow the user to specify which CUBLAS_WORKSPACE_CONFIG as :16:8 may impact performance (see here)[WIP] Enable reproducibility for distributed trainings #16907

@sgugger ?

sgugger · 2022-05-20T11:43:14Z

Agreed for the first one. For the second one, we could avoid overriding an existing CUBLAS_WORKSPACE_CONFIG if it's already in the env? In all cases, it should be clearly stated in the doc of the flag that triggers the full reproducibility that it comes at a performance price.

alexcoca · 2022-05-20T11:50:36Z

Yes, I agree with the above! I'm at ACL next week but I'll try and open a small PR to address this the week after!

hasansalimkanmaz · 2022-05-20T11:56:23Z

Thanks, @alexcoca for noticing this and for your time.

…6907) * add seed worker and set_deterministic_seed_for_cuda function to enforce reproducability * change function name to enable determinism, add docstrings, reproducability support for tf * change function name to enable_determinism_for_distributed_training * revert changes in set_seed and call set_seed within enable_full_determinism * add one position argument for seed_worker function * add full_determinism flag in training args and call enable_full_determinism when it is true * add enable_full_determinism to documentation * apply make fixup after the last commit * Update src/transformers/training_args.py Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]>

add seed worker and set_deterministic_seed_for_cuda function to enfor…

87ac401

…ce reproducability

saattrupdan suggested changes Apr 24, 2022

View reviewed changes

hasansalimkanmaz added 2 commits April 25, 2022 08:40

change function name to enable determinism, add docstrings, reproduca…

df53811

…bility support for tf

change function name to enable_determinism_for_distributed_training

041d20d

sgugger approved these changes Apr 25, 2022

View reviewed changes

LysandreJik reviewed Apr 25, 2022

View reviewed changes

sgugger reviewed Apr 27, 2022

View reviewed changes

hasansalimkanmaz added 3 commits May 10, 2022 18:07

revert changes in set_seed and call set_seed within enable_full_deter…

77e8308

…minism

Merge remote-tracking branch 'upstream/main' into enable-reproducibil…

2e40858

…ity-during-distributed-training

add one position argument for seed_worker function

7ee6c68

hasansalimkanmaz added 3 commits May 10, 2022 21:25

add full_determinism flag in training args and call enable_full_deter…

43b3669

…minism when it is true

add enable_full_determinism to documentation

3c1e31a

apply make fixup after the last commit

f8f8926

sgugger approved these changes May 11, 2022

View reviewed changes

src/transformers/training_args.py Outdated Show resolved Hide resolved

Update src/transformers/training_args.py

1066995

Co-authored-by: Sylvain Gugger <[email protected]>

sgugger merged commit c33f604 into huggingface:main May 11, 2022

hasansalimkanmaz deleted the enable-reproducibility-during-distributed-training branch May 11, 2022 14:34

saattrupdan mentioned this pull request May 15, 2022

Enable reproducibility #16549

Closed

		torch.backends.cudnn.benchmark = False


		def set_seed(seed: int, set_seed_for_cuda: bool = True):

		tf.config.experimental.enable_op_determinism()


		def set_seed(seed: int, enable_determinism: bool = True):

	def set_seed(seed: int, enable_determinism: bool = True):
	def set_seed(seed: int, full_determinism: bool = False):

		set_seed(worker_seed)


		def enable_determinism_for_distributed_training():

	def enable_determinism_for_distributed_training():
	def enable_full_determinism(seed: int):

		if enable_determinism:
		enable_determinism_for_distributed_training()

[WIP] Enable reproducibility for distributed trainings #16907

[WIP] Enable reproducibility for distributed trainings #16907

Uh oh!

Conversation

hasansalimkanmaz commented Apr 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saattrupdan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saattrupdan Apr 24, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

hasansalimkanmaz commented Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger Apr 27, 2022

Choose a reason for hiding this comment

Uh oh!

sgugger Apr 27, 2022

Choose a reason for hiding this comment

Uh oh!

hasansalimkanmaz commented May 10, 2022

Uh oh!

sgugger commented May 10, 2022

Uh oh!

hasansalimkanmaz commented May 11, 2022

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hasansalimkanmaz commented May 11, 2022

Uh oh!

sgugger commented May 11, 2022

Uh oh!

alexcoca commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saattrupdan commented May 19, 2022

Uh oh!

alexcoca commented May 19, 2022

Uh oh!

alexcoca commented May 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented May 20, 2022

Uh oh!

alexcoca commented May 20, 2022

Uh oh!

hasansalimkanmaz commented May 20, 2022

Uh oh!

Reviewers

Assignees

Labels

hasansalimkanmaz commented Apr 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 23, 2022 •

edited

Loading

saattrupdan left a comment •

edited

Loading

sgugger Apr 25, 2022 •

edited

Loading

hasansalimkanmaz commented Apr 27, 2022 •

edited

Loading

alexcoca commented May 18, 2022 •

edited

Loading

alexcoca commented May 20, 2022 •

edited

Loading