group convolution error when using DDP training #1992
              
                Unanswered
              
          
                  
                    
                      AnonymousAccount6688
                    
                  
                
                  asked this question in
                Contributing
              
            Replies: 1 comment 2 replies
-
| 
         @AnonymousAccount6688 DW models work fine for me, probably some modifications in the train script or added special cases for rank 0 that are breaking things  | 
  
Beta Was this translation helpful? Give feedback.
                  
                    2 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
I tried to use group convolution with the following line of code:
dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=64)But got the following error:
`
Using native Torch AMP. Training in mixed precision.
Using native Torch DistributedDataParallel.
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)main(args)
main(args)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^ ^^ ^^ ^^^ ^^ ^^ ^^ ^ ^^ ^^ ^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
main(args)
_verify_param_shape_across_processes(self.process_group, parameters)_verify_param_shape_across_processes(self.process_group, parameters)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^ ^ ^ ^ ^ ^ ^ ^ ^^ ^^ ^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^RuntimeError^: ^^
params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^^^^
^^^^^^RuntimeError^^^: ^^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
^
^^^^^^^^^^^^^^^RuntimeError^: ^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^
^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022927 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022928 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022929 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022930 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022931 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 1022932) of binary: /scratch365/ypeng4/software/bin/anaconda/envs/python311/bin/python
`
When I changed it to
dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=1)Everything works fine.
Is there anything wrong with the DDP training of GroupConv?
Beta Was this translation helpful? Give feedback.
All reactions