Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable CPU Offload for Intel GPU #1324

Merged
merged 11 commits into from
Nov 26, 2024
Merged

Conversation

dbyoung18
Copy link
Contributor

Background

Current CPU Offload in torchao only supports CUDA backend. We would like to add support for Intel GPU with the device option "xpu".

Details

  • add "device" attribute to CPUOffloadOptimizer, default setting to "cuda"
  • enhance and verify UT test_optim_cpu_offload_correctness & test_optim_cpu_offload_save_load pass on Intel GPU
  • add "device" argument to benchmark_low_bit_adam.py. Users can use "--device xpu" to benchmark CPU Offload on Intel GPU. Currently it supports both full BF16 and BF16 AMP training w/ eager and compiled mode. Verified workloads on Intel GPU achieve memory saving and interleaving as expected as the description in reference PR:ao#584

Copy link

pytorch-bot bot commented Nov 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1324

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 03ac00f with merge base 478d15b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 22, 2024
@dbyoung18 dbyoung18 marked this pull request as draft November 22, 2024 06:02
@dbyoung18 dbyoung18 changed the title Enable CPU Offload for Intel GPU [WIP] Enable CPU Offload for Intel GPU Nov 22, 2024
@dbyoung18 dbyoung18 changed the title [WIP] Enable CPU Offload for Intel GPU Enable CPU Offload for Intel GPU Nov 22, 2024
@dbyoung18 dbyoung18 marked this pull request as ready for review November 22, 2024 07:07
Copy link
Collaborator

@gau-nernst gau-nernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feature addition! Hopefully once the device-agnostic API support arrives, we can eliminate the if-else checks 😆

benchmarks/benchmark_low_bit_adam.py Show resolved Hide resolved
benchmarks/benchmark_low_bit_adam.py Outdated Show resolved Hide resolved
benchmarks/benchmark_low_bit_adam.py Outdated Show resolved Hide resolved
torchao/prototype/low_bit_optim/cpu_offload.py Outdated Show resolved Hide resolved
test/prototype/test_low_bit_optim.py Outdated Show resolved Hide resolved
@gau-nernst gau-nernst added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Nov 23, 2024
@gau-nernst
Copy link
Collaborator

gau-nernst commented Nov 24, 2024

@dbyoung18 Can you run ruff format and push the formatted code? CUDA nightly is failing because of bitsandbytes calling triton.ops (I think later versions of triton doesn't have triton.ops anymore bitsandbytes-foundation/bitsandbytes#1413). It's not related but not sure if we can merge until that is fixed 😢. I think other PRs will be affected too.

Otherwise, everything else looks good already!

@dbyoung18
Copy link
Contributor Author

@dbyoung18 Can you run ruff format and push the formatted code? CUDA nightly is failing because of bitsandbytes calling triton.ops (I think later versions of triton doesn't have triton.ops anymore bitsandbytes-foundation/bitsandbytes#1413). It's not related but not sure if we can merge until that is fixed 😢. I think other PRs will be affected too.

Otherwise, everything else looks good already!

Done for ruff format. Hopes the bnb issue could be resolved soon. THX again for ur review and quick feedback:)

@gau-nernst
Copy link
Collaborator

@dbyoung18 Can you merge from main? #1343 should fix the bnb issue.

Also, can you also update the doc here? https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

After that we are good to merge 😃

@dbyoung18
Copy link
Contributor Author

@dbyoung18 Can you merge from main? #1343 should fix the bnb issue.

Also, can you also update the doc here? https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

After that we are good to merge 😃

Done for both. We have a plan to gradually support torch-ao & pytorch core on Intel GPU. For this PR it covers CPU Offload only and I will look into the remain part of low-bit optimizers for next step. Since meanwhile we are also on the way to upstream FlashAttention backend to pytorch core(target v2.6 or v2.7), would like to add benchmark data to the README when it's ready. So currently, I only modify the README to make the CPU-Offload part to cover XPU scope. THX for review and I am also looking forward to make further contributions soon.😃

@gau-nernst
Copy link
Collaborator

Sounds good! The low-bit optimizers rely entirely on the tensor subclass + torch.compile() stack, so as long as there is a triton build that supports XPU backend, it should work out-of-the-box!

@msaroufim msaroufim self-requested a review November 26, 2024 03:21
@msaroufim msaroufim merged commit 6ff3904 into pytorch:main Nov 26, 2024
18 checks passed
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants