Add BF16 stochastic rounding option for optimizers #1124

gau-nernst · 2024-10-21T09:56:01Z

Stochastic rounding for BF16 weight

BF16 only has around 3 decimal precision. This means that if weight update is smaller than 1e-3 of the weight magnitude, there will be no change to the weight (using nearest rounding). This is highly problematic for full BF16 training, where we don't keep an FP32 copy of model weights.

Note that our optimizer step calculations are always done in FP32 to ensure accurate results. The "underflow" only happens when we copy the new weight value (in FP32) to the existing BF16 weight. To combat this problem, one way is to perform stochastic rounding when casting FP32->BF16.

In stochastic rounding, we will round up with the probability of (x - round_down(x)) / (round_up(x) - round_down(x)), and round down otherwise.
It follows that successive weight update with stochastic rounding will correctly approximate high-precision weight update.
Since BF16 is simply a truncation of FP32, there is an efficient implementation for FP32->BF16 stochastic rounding (the same is not true for FP32->FP16).
More detailed discussion can be found at https://arxiv.org/abs/2010.06192. llm.c also implements this approach.

# a clone of torch.optim.AdamW with extra features
from torchao.prototype.low_bit_optim import _AdamW

model = ...
model_bf16 = model.bfloat16()
optim = _AdamW(model_bf16.parameters(), bf16_stochastic_round=True)

All of our low-bit optimizers mentioned above also support bf16_stochastic_round flag. Note that this flag only applies to BF16 weight.

Experimental results

I purposely use small LR (1e-5) to exaggerate the problem.

python benchmarks/quantized_training/pretrain_llama2.py --seed 2024 --bf16_model --compile --lr 1e-5 # full BF16
python benchmarks/quantized_training/pretrain_llama2.py --seed 2024 --bf16_model --compile --lr 1e-5 --optim_kwargs {"bf16_stochastic_round":true} # with stochastic rounding
python benchmarks/quantized_training/pretrain_llama2.py --seed 2024 --bf16_amp --compile --lr 1e-5 # BF16 AMP (FP32 weight)

BF16 stochastic round matches BF16 amp loss curve, while having the same memory footprint and speed as full BF16 (BF16 amp is slower due to amp overhead).

pytorch-bot · 2024-10-21T09:56:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1124

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 935d198 with merge base 85ec209 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim

lgtm, wanna start moving low bit optim out of prototype? as far as I can tell you've been keeping BC guarantees. Might just need to forward fix with huggingface transformers or keep the calling functions in prototype but just make them throw a warning and call the non prototype code

gau-nernst · 2024-10-22T01:17:56Z

Once CI is green can I merge? Just changed the llm.c ref to permalink instead of pointing to main branch

wanna start moving low bit optim out of prototype

Sounds good. We can re-import the optim under torchao.prototype.low_bit_optim (and raise a warning) for BC like you suggested.

Should we also take this chance to rename the folder to just optim? Since I added some non-low-bit features, like CPU offload, support for tensor subclass param (for quantized training), and this BF16 stochastic rounding 😄

…stom formatting options including file name: line_number (pytorch#1124) * add SingletonLogger with custom formatting options * ruff formatting

gau-nernst added 2 commits October 21, 2024 07:30

add BF16 sr for optimizer

e28b855

update doc and benchmark scripts

218214c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 21, 2024

gau-nernst added 3 commits October 21, 2024 19:27

fix device

f4220da

remove fused=True since CPU does not support

796f285

Merge branch 'pytorch:main' into bf16_optim_sr

cf0579a

msaroufim approved these changes Oct 21, 2024

View reviewed changes

gau-nernst added 2 commits October 22, 2024 01:10

Merge branch 'main' into bf16_optim_sr

1fdf3a4

use permalink for llm.c ref

935d198

gau-nernst merged commit a31e15d into pytorch:main Oct 23, 2024
17 checks passed

gau-nernst deleted the bf16_optim_sr branch October 23, 2024 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add BF16 stochastic rounding option for optimizers #1124

Add BF16 stochastic rounding option for optimizers #1124

Uh oh!

gau-nernst commented Oct 21, 2024

Uh oh!

pytorch-bot bot commented Oct 21, 2024 •

edited

Loading

Uh oh!

msaroufim left a comment

Uh oh!

gau-nernst commented Oct 22, 2024

Uh oh!

Uh oh!

Uh oh!

Add BF16 stochastic rounding option for optimizers #1124

Add BF16 stochastic rounding option for optimizers #1124

Uh oh!

Conversation

gau-nernst commented Oct 21, 2024

Stochastic rounding for BF16 weight

Experimental results

Uh oh!

pytorch-bot bot commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1124

✅ No Failures

Uh oh!

msaroufim left a comment

Choose a reason for hiding this comment

Uh oh!

gau-nernst commented Oct 22, 2024

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 21, 2024 •

edited

Loading