Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation for CPU NUFFT with PyTorch 1.8 #25

Closed
mmuckley opened this issue Mar 10, 2021 · 1 comment
Closed

Performance degradation for CPU NUFFT with PyTorch 1.8 #25

mmuckley opened this issue Mar 10, 2021 · 1 comment

Comments

@mmuckley
Copy link
Owner

I am noticing the performance degradation with PyTorch 1.8 for the CPU on my home system (Windows 10, i5 8400, GTX 1660, torchkbnufft version 1.1.0). It looks like the GPU is relatively unaffected. The details are below. I'm not sure why this is happening yet, but I will try to look into it. If anyone has any information, feel free to post on this issue.

PyTorch 1.8:

running profiler...
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: False, toep_mat: False, size_3d: None
forward average time: 2.0657340599999996, backward average time: 3.4234444799999992
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: False, toep_mat: True, size_3d: None
toeplitz forward/backward average time: 0.13343995500000005
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: True, toep_mat: False, size_3d: None
forward average time: 1.0262545000000016, backward average time: 1.0705226799999992
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: False, toep_mat: False, size_3d: None
GPU forward max memory: 0.159003136 GB, forward average time: 0.074529785, GPU adjoint max memory: 0.152530432 GB, backward average time: 0.0685140699999998
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: False, toep_mat: True, size_3d: None
GPU forward max memory: 0.114505216 GB, toeplitz forward/backward average time: 0.006467924000000096
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: True, toep_mat: False, size_3d: None
GPU forward max memory: 0.77268992 GB, forward average time: 0.20692490499999963, GPU adjoint max memory: 1.035167232 GB, backward average time: 0.2132972450000004

PyTorch 1.7.1:

running profiler...
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: False, toep_mat: False, size_3d: None
forward average time: 1.8955573599999997, backward average time: 1.6387825
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: False, toep_mat: True, size_3d: None
toeplitz forward/backward average time: 0.12237997000000007
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: True, toep_mat: False, size_3d: None
forward average time: 0.8352743000000004, backward average time: 1.01682184
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: False, toep_mat: False, size_3d: None
GPU forward max memory: 0.158736896 GB, forward average time: 0.07951689000000002, GPU adjoint max memory: 0.152530432 GB, backward average time: 0.06854967499999987
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: False, toep_mat: True, size_3d: None
GPU forward max memory: 0.114505216 GB, toeplitz forward/backward average time: 0.006591889999999978
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: True, toep_mat: False, size_3d: None
GPU forward max memory: 0.77268992 GB, forward average time: 0.2121914199999999, GPU adjoint max memory: 1.035167232 GB, backward average time: 0.21654677000000006
@mmuckley
Copy link
Owner Author

mmuckley commented Apr 9, 2021

I identified the cause of this issue. It was due to overhead of repeated calls to torch.set_num_threads. Apparently these are more expensive in PyTorch 1.8.

I previously added these lines due to an observation that torchkbnufft wouldn't respect an OMP_NUM_THREADS environment variable. For example, if you had 8 threads on your system and you set OMP_NUM_THREADS, torchknufft would use all 8 threads if you didn't use the torch.set_num_threads commands during process forks. After removing the lines, the performance of the adjoint on CPU is much better. I don't see the oversubscription issue for the adjoint, but it remains for the forward operation, so we may need to do further work on adjoint threading.

I think I'm going to release version 1.2.0 of torchkbnufft for now to handle PyTorch 1.8 without regressions, and we can try to do more threading work for the adjoint in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant