Switching to Torch 2.0 by default. #1922

jkulhanek · 2023-05-15T16:39:32Z

This PR drops support for torch 1.12 and adds support for torch 2.0. Can we open a discussion if this is a good thing to do or not?

Note: we can try torch 2.0 compile to see if there are any speed improvements to be gained.

jkulhanek · 2023-05-16T08:48:27Z

nerfstudio/engine/schedulers.py

+        LRScheduler,
+    )
+except ImportError:
+    # Backward compatibility for PyTorch 1.x


This offers backward compatibility for PyTorch 1.x. Should we keep it here or drop it?

Lets keep it for now, but print a warning to the user that they should update to pytorch 2.0

Dockerfile

igozali · 2023-05-16T17:50:46Z

Curious if this PR would include support for CUDA 12 as well?

jkulhanek · 2023-05-16T17:52:45Z

Sure. For now only if you compile PyTorch from source: https://discuss.pytorch.org/t/pytorch-for-cuda-12/169447

tancik · 2023-05-16T17:54:42Z

Sure. For now only if you compile PyTorch from source: https://discuss.pytorch.org/t/pytorch-for-cuda-12/169447

Not sure if Tinycudann and nerfacc would work though.

tancik · 2023-05-16T18:22:03Z

Test renders using nerfacto:
1.13:
https://github.com/nerfstudio-project/nerfstudio/assets/3310961/3f4f53b4-dc83-40a2-a106-d017d43570dd
2.0:
https://github.com/nerfstudio-project/nerfstudio/assets/3310961/ee158711-0a98-48c0-b605-d65a4a3b1633

No noticeable quality differences.

tancik · 2023-05-16T18:23:08Z

The sphinx error is /home/docs/checkouts/readthedocs.org/user_builds/plenoptix-nerfstudio/checkouts/1922/docs/quickstart/installation.md:199: WARNING: 'myst' reference target not found: tiny-cuda-syntax-error

…ewer-torch

tancik · 2023-05-16T18:40:05Z

There appears to be an issue with instant-ngp, it is running much slower, but the quality is equivalent. Maybe @liruilong940607 has an idea?

x-axis seconds:

x-axis step:

liruilong940607 · 2023-05-16T19:14:36Z

Interesting, is this with torch2.0 + torch.compile() or simply switching to torch2.0?

tancik · 2023-05-16T19:22:05Z

Interesting, is this with torch2.0 + torch.compile() or simply switching to torch2.0?

Just switching to torch 2.0

jkulhanek · 2023-05-16T19:23:54Z

I am installing a fresh environment to test it.

jkulhanek · 2023-05-16T20:10:26Z

torch.compile breaks in @time_function wrapper.

liruilong940607 · 2023-05-17T17:39:24Z

The poster scene I was testing on also has the distortion parameters. Which scene were you running on? I can also test that out.

liruilong940607 · 2023-05-17T17:43:52Z

It is possible that the JIT of that function makes timing weird. What if you just disable the JIT (also do not do torch compile)? I'm not familiar with either but at least we can know if it has something to do with them.

tancik · 2023-05-17T17:44:10Z

The poster scene I was testing on also has the distortion parameters. Which scene were you running on? I can also test that out.

https://data.nerf.studio/nerfstudio/floating-tree.zip

liruilong940607 · 2023-05-17T17:58:09Z

The poster scene I was testing on also has the distortion parameters. Which scene were you running on? I can also test that out.

https://data.nerf.studio/nerfstudio/floating-tree.zip

Ok I can reproduce this slowness with floating tree. The GPU stats are very different between these two runs.

tancik · 2023-05-17T18:05:16Z

I did some rough timings for 2K iters

model	opt method	1.13 time (sec)	2.0 time (sec)
nerfacto	jit script	66	73
nerfacto	compile	-	74
nerfacto	none	64	71
ingp	jit script	66	🐢 350 iters after 60sec
ingp	compile	-	🐢 10 iters after 60sec
ingp	none	58	60

Interestingly this suggests the we shouldn't use any JIT, which is odd since when we added JIT it was a big speed boost. Maybe it is because my training is CPU limited? Can someone else test how the speed compares when not using JIT?

jkulhanek · 2023-05-17T18:37:00Z

I have tested 2K iters on A100, CUDA 11.8, PyTorch 2.0.1:

configuration	train time
torch.jit.script	300s
eager mode	105s
torch.compile	490s
torch.compile(dynamic=True)	101s
torch.compile(dynamic=True, mode="reduce-overhead")	93s

the initial compilation takes sooo long

jkulhanek · 2023-05-17T18:39:57Z

Actually for me torch.compile is the fastest. But the compilation takes a long time at the beginning:

...but first iterations:

tancik · 2023-05-17T18:51:14Z

So is eager the way to go for now? @jkulhanek were these tests with the ingp and the floating tree dataset?

jkulhanek · 2023-05-17T18:52:39Z

@tancik , can you please try @torch.compile(dynamic=True) . This is the fastest for me now. I will still play with the configuration for a bit, but I believe this could be it.

jkulhanek · 2023-05-17T18:53:52Z

So is eager the way to go for now? @jkulhanek were these tests with the ingp and the floating tree dataset?

Yes, ingp with floating tree dataset. I wonder if A100 is so much slower compared to 4090 or what is the problem?

jkulhanek · 2023-05-17T19:00:52Z

Alternatively there is also mode="reduce-overhead". I would need to do more precise profiling, and there is the tradeoff if we want to spend more time on compile at the beginning of training or have slower iterations.

I suggest we stick with torch.compile(dynamic=True, mode="reduce-overhead") for now (see the table).

liruilong940607 · 2023-05-17T19:46:14Z

I can confirm that torch.compile(dynamic=True, mode="reduce-overhead") is the best option:

Additionally I tested using nerfacc.cameras.opencv_lens_undistortion() to replace the radial_and_tangential_undistort(), which basically fuse all computation into a single CUDA kernel. It seems like the torch.compile(dynamic=True, mode="reduce-overhead") give similar speed with nerfacc's explicit fuse, only with a little bit compiling overhead in the first place (nerfacc needs compiling too but did not reveal here):

Test with INGP, TITAN RTX, CUDA 11.7, 5000 iters, floating-tree scene.

jkulhanek · 2023-05-17T19:55:42Z

Cool! @liruilong940607 can you please also try with additional backend=“eager”? That setup is the fastest for me.

liruilong940607 · 2023-05-17T20:03:47Z

torch.compile(dynamic=True, mode="reduce-overhead", backend="eager") is not that good for me. (The compile-eager line in the plot)

tancik · 2023-05-17T20:18:06Z

Here are some results on nerfacto:

@torch.compile(dynamic=True, mode="reduce-overhead", backend="eager")
@torch.compile(dynamic=True, mode="reduce-overhead")
@torch.compile(dynamic=True, backend="eager")
@torch.compile(dynamic=False, mode="reduce-overhead", backend="eager")
main branch, pytorch 1.13

They are all basically the same, with the exception of @torch.compile(dynamic=True, mode="reduce-overhead") which is worse.

liruilong940607 · 2023-05-17T20:28:59Z

Oh I didn't know torch 1.13 also supports torch.compile. Isn't it a feature introduce in torch 2.0?

tancik · 2023-05-17T20:30:30Z

Oh I didn't know torch 1.13 also supports torch.compile. Isn't it a feature introduce in torch 2.0?

It doesn't I'm just plotting the main branch for reference (It uses @torch.jit.script)

tancik · 2023-05-17T20:36:30Z

Here are the same plots for ingp

In this case @torch.compile(dynamic=True, mode="reduce-overhead") starts off slow but ends up being the fastest.

liruilong940607 · 2023-05-17T20:43:03Z

I see. Does this mean torch.compile behaves basically the same with jit.script, in the case of static shape input? And they only behave differently for dynamic shape (and that’s when all these argument starts to matter)?

I’m curious what’s the logic behind different choices. Some toy examples might be helpful. (Or is there a tutorial somewhere to explain these things?)

Seems like we are ending up in a situation that different model (or even data?) should have different optimization strategy, which kinda makes sense to me.

jkulhanek · 2023-05-17T20:45:42Z

Here is some nice info:
https://pytorch.org/get-started/pytorch-2.0/#user-experience

also here:
https://pytorch.org/docs/stable/dynamo/faq.html

jkulhanek · 2023-05-17T20:58:56Z

It would be cool if we could then torch.compile the whole model to see if we can speed it up. Currently, there are several places where it breaks for nerfacto and other models. I expected the compilation to take much longer, but if it is too much we can cache the compiled model to drive (currently API quite experimental https://pytorch.org/get-started/pytorch-2.0/#user-experience ) so this will only happen once - same as nerfacc kernels I guess.

tancik

LGTM

jkulhanek added 5 commits May 15, 2023 18:28

Update requirements for torch 2.0

f18d230

Update docs

b8a636d

Fix missing _LRScheduler import

2ed6ccb

Fix pylint issues

015d796

Merge branch 'main' into jkulhanek/switch-to-newer-torch

18981ad

SauravMaheshkar added enhancement New feature or request speedup dependencies Pull requests that update a dependency file python Pull requests that update Python code labels May 15, 2023

SauravMaheshkar assigned jkulhanek May 15, 2023

SauravMaheshkar mentioned this pull request May 15, 2023

[WIP] PyTorch 2.0 and torch.compile support #1830

Closed

Update docs, drop 1.13

cfb2ee1

jkulhanek commented May 16, 2023

View reviewed changes

SauravMaheshkar requested a review from tancik May 16, 2023 08:48

tancik reviewed May 16, 2023

View reviewed changes

Dockerfile Outdated Show resolved Hide resolved

Fix pip postfix version

d61efaf

jkulhanek added 2 commits May 16, 2023 20:36

fix docs

1f40e78

Merge remote-tracking branch 'origin/main' into jkulhanek/switch-to-n…

40775c5

…ewer-torch

jkulhanek force-pushed the jkulhanek/switch-to-newer-torch branch from a1f88d4 to 40775c5 Compare May 16, 2023 18:38

Allow python 1.13 (for now)

7bbef9c

jkulhanek and others added 5 commits May 17, 2023 23:05

Switch to torch.compile

f116caf

Merge remote-tracking branch 'origin/main' into pr/jkulhanek/1922

24d1c40

ruff

ab590dd

Fix test

1acc042

Merge branch 'main' into jkulhanek/switch-to-newer-torch

0ae0a8e

tancik approved these changes May 17, 2023

View reviewed changes

tancik merged commit e898f56 into main May 17, 2023

tancik deleted the jkulhanek/switch-to-newer-torch branch May 17, 2023 23:03

jkulhanek mentioned this pull request Nov 1, 2023

Fix torch.compile for torch 2.1 #2577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching to Torch 2.0 by default. #1922

Switching to Torch 2.0 by default. #1922

jkulhanek commented May 15, 2023

jkulhanek May 16, 2023

tancik May 16, 2023

igozali commented May 16, 2023

jkulhanek commented May 16, 2023

tancik commented May 16, 2023

tancik commented May 16, 2023

tancik commented May 16, 2023

tancik commented May 16, 2023

liruilong940607 commented May 16, 2023

tancik commented May 16, 2023

jkulhanek commented May 16, 2023

jkulhanek commented May 16, 2023

liruilong940607 commented May 17, 2023 •

edited by tancik

Loading

liruilong940607 commented May 17, 2023

tancik commented May 17, 2023

liruilong940607 commented May 17, 2023

tancik commented May 17, 2023 •

edited

Loading

jkulhanek commented May 17, 2023 •

edited

Loading

jkulhanek commented May 17, 2023 •

edited

Loading

tancik commented May 17, 2023

jkulhanek commented May 17, 2023

jkulhanek commented May 17, 2023

jkulhanek commented May 17, 2023 •

edited

Loading

liruilong940607 commented May 17, 2023

jkulhanek commented May 17, 2023

liruilong940607 commented May 17, 2023 •

edited

Loading

tancik commented May 17, 2023

liruilong940607 commented May 17, 2023

tancik commented May 17, 2023

tancik commented May 17, 2023 •

edited

Loading

liruilong940607 commented May 17, 2023

jkulhanek commented May 17, 2023

jkulhanek commented May 17, 2023

tancik left a comment

Switching to Torch 2.0 by default. #1922

Switching to Torch 2.0 by default. #1922

Conversation

jkulhanek commented May 15, 2023

jkulhanek May 16, 2023

Choose a reason for hiding this comment

tancik May 16, 2023

Choose a reason for hiding this comment

igozali commented May 16, 2023

jkulhanek commented May 16, 2023

tancik commented May 16, 2023

tancik commented May 16, 2023

tancik commented May 16, 2023

tancik commented May 16, 2023

liruilong940607 commented May 16, 2023

tancik commented May 16, 2023

jkulhanek commented May 16, 2023

jkulhanek commented May 16, 2023

liruilong940607 commented May 17, 2023 • edited by tancik Loading

liruilong940607 commented May 17, 2023

tancik commented May 17, 2023

liruilong940607 commented May 17, 2023

tancik commented May 17, 2023 • edited Loading

jkulhanek commented May 17, 2023 • edited Loading

jkulhanek commented May 17, 2023 • edited Loading

tancik commented May 17, 2023

jkulhanek commented May 17, 2023

jkulhanek commented May 17, 2023

jkulhanek commented May 17, 2023 • edited Loading

liruilong940607 commented May 17, 2023

jkulhanek commented May 17, 2023

liruilong940607 commented May 17, 2023 • edited Loading

tancik commented May 17, 2023

liruilong940607 commented May 17, 2023

tancik commented May 17, 2023

tancik commented May 17, 2023 • edited Loading

liruilong940607 commented May 17, 2023

jkulhanek commented May 17, 2023

jkulhanek commented May 17, 2023

tancik left a comment

Choose a reason for hiding this comment

liruilong940607 commented May 17, 2023 •

edited by tancik

Loading

tancik commented May 17, 2023 •

edited

Loading

jkulhanek commented May 17, 2023 •

edited

Loading

jkulhanek commented May 17, 2023 •

edited

Loading

jkulhanek commented May 17, 2023 •

edited

Loading

liruilong940607 commented May 17, 2023 •

edited

Loading

tancik commented May 17, 2023 •

edited

Loading