Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training command for single gpu training? #137

Open
dsbyprateekg opened this issue Jan 7, 2025 · 4 comments
Open

training command for single gpu training? #137

dsbyprateekg opened this issue Jan 7, 2025 · 4 comments

Comments

@dsbyprateekg
Copy link

dsbyprateekg commented Jan 7, 2025

Hi,

Can you please share what is the command for a custom training in a single GPU?
I want to use colab or kaggle for training.

I used following two commands but got same error-
!torchrun train.py -c /kaggle/working/D-FINE/configs/dfine/custom/objects365/dfine_hgnetv2_s_obj2custom.yml --use-amp --seed=0 -t /kaggle/working/D-FINE/dfine_s_obj365.pth

!CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c /kaggle/working/D-FINE/configs/dataset/custom_detection.yml --use-amp --seed=0 -t /kaggle/working/D-FINE/dfine_s_obj365.pth

And got error-
`state = torch.load(path, map_location='cpu')
[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 181, in load_tuning_state
[rank0]: adjusted_state_dict = self._adjust_head_parameters(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/train.py", line 84, in
[rank0]: main(args)
[rank0]: File "/kaggle/working/D-FINE/train.py", line 54, in main
[rank0]: solver.fit()
[rank0]: File "/kaggle/working/D-FINE/src/solver/det_solver.py", line 24, in fit
[rank0]: self.train()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 81, in train
[rank0]: self._setup()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 52, in _setup
[rank0]: self.load_tuning_state(self.cfg.tuning)
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 184, in load_tuning_state
[rank0]: stat, infos = self._matched_state(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'
E0127 11:36:46.344000 134127621440640 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 63) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-27_11:36:46
host : 64c81d2f379b
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 63)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`

@FarhAnonymous
Copy link

This is common in Colab when trying to use multiple GPUs, as Colab typically only provides a single GPU.

!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 /content/drive/MyDrive/D-Fine/D-FINE-master/train.py -c /content/drive/MyDrive/D-Fine/D-FINE-master/configs/dfine/custom/dfine_hgnetv2_n_custom.yml --use-amp --seed=0

Removed the --master_port=7777 as it's not needed for single GPU training.

This should work better in the Colab environment since it will use only one GPU. The training might be slower than with multiple GPUs, but it will be more stable in the Colab environment.

@dsbyprateekg
Copy link
Author

dsbyprateekg commented Feb 6, 2025

This is common in Colab when trying to use multiple GPUs, as Colab typically only provides a single GPU.

!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 /content/drive/MyDrive/D-Fine/D-FINE-master/train.py -c /content/drive/MyDrive/D-Fine/D-FINE-master/configs/dfine/custom/dfine_hgnetv2_n_custom.yml --use-amp --seed=0

Removed the --master_port=7777 as it's not needed for single GPU training.

This should work better in the Colab environment since it will use only one GPU. The training might be slower than with multiple GPUs, but it will be more stable in the Colab environment.

I tried but still got a new error-
`state = torch.load(path, map_location='cpu')
[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 181, in load_tuning_state
[rank0]: adjusted_state_dict = self._adjust_head_parameters(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/train.py", line 84, in
[rank0]: main(args)
[rank0]: File "/kaggle/working/D-FINE/train.py", line 54, in main
[rank0]: solver.fit()
[rank0]: File "/kaggle/working/D-FINE/src/solver/det_solver.py", line 24, in fit
[rank0]: self.train()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 81, in train
[rank0]: self._setup()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 52, in _setup
[rank0]: self.load_tuning_state(self.cfg.tuning)
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 184, in load_tuning_state
[rank0]: stat, infos = self._matched_state(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'
E0206 09:14:44.299000 133192790373504 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 108) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-02-06_09:14:44
host : 8effb7021c67
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 108)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`

@FarhAnonymous
Copy link

This error indicates the model initialization is failing because it can't load the pretrained weights properly. The key issue is: 'NoneType' object has no attribute 'state_dict', meaning the model object is None when trying to load weights.

Check your config file path is correct:

-c /path/to/config.yml

@dsbyprateekg
Copy link
Author

This error indicates the model initialization is failing because it can't load the pretrained weights properly. The key issue is: 'NoneType' object has no attribute 'state_dict', meaning the model object is None when trying to load weights.

Check your config file path is correct:

-c /path/to/config.yml

I have put my config here: /kaggle/working/D-FINE/configs/dataset/custom_detection.yml

and used command !CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py -c /kaggle/working/D-FINE/configs/dataset/custom_detection.yml --use-amp --seed=0 -t /kaggle/working/D-FINE/dfine_s_obj2coco.pth

But getting same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants