Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --precision option to ketos train and ketos segtrain #453

Merged
merged 7 commits into from
Feb 23, 2023
Merged

add --precision option to ketos train and ketos segtrain #453

merged 7 commits into from
Feb 23, 2023

Conversation

colibrisson
Copy link
Contributor

Add a --precision option to ketos train and ketos segtrain to choose the numerical precision to use during training as discussed in #451. It can be set to: '32', 'bf16', '16', '16-mixed', 'bf16-mixed'. The default is 16-mixed.

@mittagessen
Copy link
Owner

You're running pytorch-lightning master, right? Because the values supported by the latest stable release are ('64', '32', '16', 'bf16'). I'd change them to the current values and pin PTL to >=1.9.0,<2.0.

@PonteIneptique
Copy link
Contributor

As I understand **-mixed values are split before being passed to the Trainer which indeed only accept 64, 32, b16 and 16.

Mixed is use to configure AMP in the plugin:

if 'mixed' in kwargs['precision']:
precision = kwargs['precision'].split('-')[0]
kwargs['precision'] = precision
kwargs['plugins'] = [pl.plugins.precision.MixedPrecisionPlugin(precision, 'cuda')]

As per last stable release

Maybe a sanity check on the device should be done though.

@colibrisson
Copy link
Contributor Author

The precision argument in trainer only set the numerical precision: precision=16 is just half precision. To use AMP you need to provide a plugin to the trainer object. It's why I added 16-mixed.

@colibrisson
Copy link
Contributor Author

colibrisson commented Feb 23, 2023

Maybe a sanity check on the device should be done though.

You are right. Should I limit mixed precision to ADA and the latter GPU? Maybe PL already has a fallback mechanism.

@PonteIneptique
Copy link
Contributor

No I actually meant it would be great to check that the device used is CUDA (in case someone does something weird such as mixed on CPU)

@PonteIneptique
Copy link
Contributor

Adding onto what I just said: actually mixed should be the default only if you use CUDA, no ?

@colibrisson
Copy link
Contributor Author

colibrisson commented Feb 23, 2023

Adding onto what I just said: actually mixed should be the default only if you use CUDA, no ?

You are right. I will add it.

@mittagessen
Copy link
Owner

mittagessen commented Feb 23, 2023

The precision argument in trainer only set the numerical precision: precision=16 is just half precision. To use AMP you need to provide a plugin to the trainer object. It's why I added 16-mixed.

Not in stable. precision=16 is enough to enable AMP (pure half precision training isn't supported). Master/2.0 changes/will change the behavior to what you describe. See Lightning-AI/pytorch-lightning#9956 (comment).

EDIT: Pure half precision training on master is still not possible. The semantics are explained here. There's no 16-true value.

@mittagessen
Copy link
Owner

By the way mixed precision also works on CPU so it can be left enabled without CUDA as well. The question is if other accelerators like MPS support it so it might be best to filter it out for any device that isn't cuda/cpu.

@colibrisson
Copy link
Contributor Author

colibrisson commented Feb 23, 2023

I know but the semantic you are referring to is only implemented in master: Lightning-AI/pytorch-lightning#16783 (comment). With PL<=1.9, if you set precision=16, Cuda will issue the following warning:

Using 16bit None Automatic Mixed Precision (AMP)

It sounds like true half-precision to me.

@colibrisson
Copy link
Contributor Author

As soon as PL 2.0 get released, we can get rid of:

if 'mixed' in kwargs['precision']:
precision = kwargs['precision'].split('-')[0]
kwargs['precision'] = precision
kwargs['plugins'] = [pl.plugins.precision.MixedPrecisionPlugin(precision, 'cuda')]

@mittagessen
Copy link
Owner

mittagessen commented Feb 23, 2023

As I said, there's no true 16bit precision training in PTL, neither stable nor master. The plugin is completely unnecessary.

@colibrisson
Copy link
Contributor Author

As I said, there's no true 16bit precision training in PTL, neither stable nor master. The plugin is completely unnecessary.

So why Cuda says "Using 16bit None Automatic Mixed Precision (AMP)"?

@PonteIneptique
Copy link
Contributor

The blame for this specific print could be https://github.com/Lightning-AI/lightning/blame/5fafe10a2598bb455aa387f0f123b328b9be7177/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L745

We used to have to provide a AMP mode I think in lightning: https://pytorch-lightning.readthedocs.io/en/1.8.1/common/trainer.html#amp-backend

Try setting Trainer(amp_backend="native") just to see if this is the issue :)

@mittagessen
Copy link
Owner

mittagessen commented Feb 23, 2023

The format string is f"Using 16bit {self._amp_type_flag} Automatic Mixed Precision (AMP)". The None refers to the AMP implementation flag that can optionally be given to the trainer (apex or native). It defaults to native if none is given. It isn't a warning, just a info message.

@colibrisson
Copy link
Contributor Author

My bad, I thought it was a Cuda warning.

@colibrisson
Copy link
Contributor Author

Any suggestions?

@mittagessen
Copy link
Owner

If you could add it to the pretraining command as well I'd merge it today.

@mittagessen
Copy link
Owner

Thanks!

@mittagessen mittagessen merged commit 50d7860 into mittagessen:master Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants