[WIP] functional autograd + compiled autograd #139098

zou3519 · 2024-10-28T20:04:33Z

Stack from ghstack (oldest at bottom):

This commit refactors autograd so that nodes can be called in a
functional way. Furthermore, it refactors compiled autograd to use
the new functional autograd, without any behavior changes.

This is on the way to getting compiled autograd to stop tracing into
autograd nodes when it constructs an FX graph out of the autograd graph.
We also implement some very basic support for that, which can be toggled
via old_inline_behavior=False in compiled_autograd.py.

Functional autograd works like the following:

All torch::autograd::Node must define a
retrieve_saved(SwapSavedVariables) -> ivalue_list API. This function
takes compiled autograd's SwapSavedVariables and packs the state that
is relevant to the current Node into an ivalue_list.
All torch::autograd::Node must define a
get_functional() -> std::function.
This returns a new stateless function that accepts the
gradients and saved values as an ivalue_list and returns new
gradients.
We developed a mechanism to bind arbitrary C++ functions that take
ivalue_list to Python.
This is really similar to how we bind custom ops to Python and was
done in consideration of the Windows symbol limit (otherwise, we'd be
binding one symbol per Node into Python).

Here's an example of the new autograd generated code

https://gist.github.com/zou3519/09bb98bb0f11445bc3da063201adb818

Here's an example of the FX graph compiled autograd produces (with
old_inline_behavior=False):

https://gist.github.com/zou3519/43e8106176d15d623e1377850f585c97

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @voznesenskym @penguinwu @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @rec @xmfan

[ghstack-poisoned]

pytorch-bot · 2024-10-28T20:04:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139098

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 36 New Failures

As of commit 368259d with merge base 5c88a9f ():

NEW FAILURES - The following jobs have failed:

Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x e2f6fe6 returned non-zero exit code 1
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
Super_SloMo
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh)
distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
Super_SloMo
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_prod_cuda_complex128
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_GRU_train_mode_cuda_float32
inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_prod_cuda_complex128
inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_GRU_train_mode_cuda_float32
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_avx2, 1, 2, linux.10xlarge.avx2) (gh)
test_ops_gradients.py::TestBwdGradientsCPU::test_fn_grad_linalg_vander_cpu_complex128
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_avx2, 2, 2, linux.10xlarge.avx2) (gh)
test_ops_gradients.py::TestBwdGradientsCPU::test_fn_gradgrad_cumprod_cpu_float64
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_avx512, 2, 2, linux.12xlarge) (gh)
test_ops_gradients.py::TestBwdGradientsCPU::test_fn_grad_linalg_vander_cpu_complex128
Lint / lintrunner-clang / linux-job (gh)
>>> Lint for torch/csrc/autograd/python_function.h:
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for tools/autograd/gen_autograd_functions.py:
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_backward_twice_with_saved_values
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
test_expanded_weights.py::TestExpandedWeightModuleCUDA::test_module_nn_GRU_eval_mode_cuda_float32
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_backward_twice_with_saved_values
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace
pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge) (gh)
test_EmbeddingBag_per_sample_weights_and_offsets_xla_int64_int64_float64
pull / linux-focal-py3.11-clang10 / test (default, 2, 4, lf.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace
pull / linux-focal-py3.11-clang10 / test (dynamo, 2, 3, lf.linux.2xlarge) (gh)
nn/test_embedding.py::TestEmbeddingNNDeviceTypeCPU::test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu_int32_int32_bfloat16
pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh)
nn/test_parametrization.py::TestNNParametrization::test_register_and_remove_parametrization_swap_False
pull / linux-focal-py3.12-clang10 / test (default, 2, 4, lf.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_backward_twice_with_saved_values
pull / linux-focal-py3.12-clang10 / test (default, 3, 4, lf.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace
pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh)
nn/test_parametrization.py::TestNNParametrization::test_register_and_remove_parametrization_swap_False
pull / linux-focal-py3.9-clang10 / test (default, 2, 4, lf.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_backward_twice_with_saved_values
pull / linux-focal-py3.9-clang10 / test (default, 3, 4, lf.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace
pull / linux-focal-py3.9-clang10 / test (dynamo, 2, 3, lf.linux.2xlarge) (gh)
nn/test_embedding.py::TestEmbeddingNNDeviceTypeCPU::test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu_int32_int32_bfloat16
pull / linux-focal-py3.9-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh)
nn/test_parametrization.py::TestNNParametrization::test_register_and_remove_parametrization_swap_False
pull / linux-focal-py3.9-clang10-onnx / test (default, 2, 2, lf.linux.2xlarge) (gh)
onnx/test_pytorch_jit_onnx.py::TestJITIRToONNX_opset14::test_add_sub_with_graph_inputs
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, lf.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, lf.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_copy_slices_graph_task_updates
pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, lf.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_backward_twice_with_saved_values
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 4, lf.linux.2xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_backward_twice_with_saved_values
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 4, lf.linux.2xlarge) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace
pull / linux-jammy-py3.9-gcc11 / test (jit_legacy, 1, 1, lf.linux.2xlarge) (gh)
test_jit_legacy.py::TestJit::test_shape_analysis_broadcast

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This commit refactors autograd so that nodes can be called in a functional way. Furthermore, it refactors compiled autograd to use the new functional autograd, without any behavior changes. This is on the way to getting compiled autograd to stop tracing into autograd nodes when it constructs an FX graph out of the autograd graph. We also implement some very basic support for that, which can be toggled via `old_inline_behavior=False` in compiled_autograd.py. Functional autograd works like the following: - All torch::autograd::Node must define a `retrieve_saved(SwapSavedVariables) -> ivalue_list` API. This function takes compiled autograd's SwapSavedVariables and packs the state that is relevant to the current Node into an ivalue_list. - All torch::autograd::Node must define a `get_functional() -> std::function`. This returns a new stateless function that accepts the gradients and saved values as an ivalue_list and returns new gradients. - We developed a mechanism to bind arbitrary C++ functions that take ivalue_list to Python. This is really similar to how we bind custom ops to Python and was done in consideration of the Windows symbol limit (otherwise, we'd be binding one symbol per Node into Python). Here's an example of the new autograd generated code - https://gist.github.com/zou3519/09bb98bb0f11445bc3da063201adb818 Here's an example of the FX graph compiled autograd produces (with old_inline_behavior=False): - https://gist.github.com/zou3519/43e8106176d15d623e1377850f585c97 ghstack-source-id: c2f93cd46dd245ccc26cf1bd07f861cb18267eaf Pull Request resolved: #139098

This commit refactors autograd so that nodes can be called in a functional way. Furthermore, it refactors compiled autograd to use the new functional autograd, without any behavior changes. This is on the way to getting compiled autograd to stop tracing into autograd nodes when it constructs an FX graph out of the autograd graph. We also implement some very basic support for that, which can be toggled via `old_inline_behavior=False` in compiled_autograd.py. Functional autograd works like the following: - All torch::autograd::Node must define a `retrieve_saved(SwapSavedVariables) -> ivalue_list` API. This function takes compiled autograd's SwapSavedVariables and packs the state that is relevant to the current Node into an ivalue_list. - All torch::autograd::Node must define a `get_functional() -> std::function`. This returns a new stateless function that accepts the gradients and saved values as an ivalue_list and returns new gradients. - We developed a mechanism to bind arbitrary C++ functions that take ivalue_list to Python. This is really similar to how we bind custom ops to Python and was done in consideration of the Windows symbol limit (otherwise, we'd be binding one symbol per Node into Python). Here's an example of the new autograd generated code - https://gist.github.com/zou3519/09bb98bb0f11445bc3da063201adb818 Here's an example of the FX graph compiled autograd produces (with old_inline_behavior=False): - https://gist.github.com/zou3519/43e8106176d15d623e1377850f585c97 ghstack-source-id: c2f93cd46dd245ccc26cf1bd07f861cb18267eaf Pull Request resolved: pytorch/pytorch#139098

github-actions · 2025-01-07T07:33:53Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Update

368259d

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: compiled autograd compiled_autograd module: dynamo module: inductor release notes: jit release notes category labels Oct 28, 2024

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Oct 28, 2024

xmfan mentioned this pull request Oct 28, 2024

[fca] support accumulate grad #139121

Closed

yitingw1 mentioned this pull request Nov 20, 2024

[Compiled_autograd] running nn.LayerNorm failed for torch.compile with compiled_autograd when deepspeed Zero3 #140091

Open

github-actions bot added the Stale label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] functional autograd + compiled autograd #139098

[WIP] functional autograd + compiled autograd #139098

zou3519 commented Oct 28, 2024 •

edited by xmfan

Loading

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading

github-actions bot commented Jan 7, 2025

[WIP] functional autograd + compiled autograd #139098

Are you sure you want to change the base?

[WIP] functional autograd + compiled autograd #139098

Conversation

zou3519 commented Oct 28, 2024 • edited by xmfan Loading

pytorch-bot bot commented Oct 28, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139098

❌ 36 New Failures

github-actions bot commented Jan 7, 2025

zou3519 commented Oct 28, 2024 •

edited by xmfan

Loading

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading