[low-bit optim] Upcast everything to FP32 for internal calculations #1068

gau-nernst · 2024-10-14T12:51:51Z

Previously, it seems like torch.compile will not check for dtype mismatch when tensor subclass is used (e.g. tensor_subclass_fp32.lerp(plain_tensor_bf16, weight)). Now it does, raising the error. To fix it, I simply cast everything to FP32.

The dtype mismatch comes from the fact that my tensor subclasses for optim state have always used FP32 appearance dtype, even if param is BF16. This results in FP32 calculations, which is correct, though not originally intentional. Now I have made it explicit and intentional. This also means that BF16 param + BF16 optim state combination is now more accurate.

If I have time, I will re-run some some of the benchmarks to make sure things are alright.

pytorch-bot · 2024-10-14T12:51:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1068

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 72ca834 with merge base e7b33bc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

gau-nernst · 2024-10-14T16:20:39Z

@msaroufim The failing low-bit optim tests pass now, but now CI timeouts 🤣

msaroufim · 2024-10-14T17:47:43Z

You can extend it to 2h

-timeout: 60
+timeout: 120

In https://github.com/pytorch/ao/blob/main/.github/workflows/regression_test.yml

EDIT: I just made the change myself

HDCharles

seems fine though CI is still broken, see mark's comment, will merge once CI is passing

…1068) * fix dtype * Update regression_test.yml --------- Co-authored-by: Mark Saroufim <[email protected]>

This PR makes torchchat support multi-modality model definition and constructions. To show our power in multi-modality area, we integrate flamingo component into our system. Note that this is only for bare-minimum support for model definition. Please check openai_api_multimodal branch for e2e, and pytorch#1123 (comment) for better structure and llama3.1 support

fix dtype

7683f61

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 14, 2024

HDCharles approved these changes Oct 14, 2024

View reviewed changes

Update regression_test.yml

72ca834

msaroufim merged commit afc0a02 into pytorch:main Oct 14, 2024
17 checks passed

gau-nernst deleted the fix_optim branch October 14, 2024 22:39

jainapurva pushed a commit that referenced this pull request Oct 15, 2024

[low-bit optim] Upcast everything to FP32 for internal calculations (#…

12b39fb

…1068) * fix dtype * Update regression_test.yml --------- Co-authored-by: Mark Saroufim <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[low-bit optim] Upcast everything to FP32 for internal calculations #1068

[low-bit optim] Upcast everything to FP32 for internal calculations #1068

gau-nernst commented Oct 14, 2024 •

edited

Loading

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

gau-nernst commented Oct 14, 2024

msaroufim commented Oct 14, 2024 •

edited

Loading

HDCharles left a comment •

edited

Loading

[low-bit optim] Upcast everything to FP32 for internal calculations #1068

[low-bit optim] Upcast everything to FP32 for internal calculations #1068

Conversation

gau-nernst commented Oct 14, 2024 • edited Loading

pytorch-bot bot commented Oct 14, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1068

✅ No Failures

gau-nernst commented Oct 14, 2024

msaroufim commented Oct 14, 2024 • edited Loading

HDCharles left a comment • edited Loading

Choose a reason for hiding this comment

gau-nernst commented Oct 14, 2024 •

edited

Loading

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

msaroufim commented Oct 14, 2024 •

edited

Loading

HDCharles left a comment •

edited

Loading