training command for single gpu training? #137

dsbyprateekg · 2025-01-07T11:33:32Z

Hi,

Can you please share what is the command for a custom training in a single GPU?
I want to use colab or kaggle for training.

I used following two commands but got same error-
!torchrun train.py -c /kaggle/working/D-FINE/configs/dfine/custom/objects365/dfine_hgnetv2_s_obj2custom.yml --use-amp --seed=0 -t /kaggle/working/D-FINE/dfine_s_obj365.pth

!CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c /kaggle/working/D-FINE/configs/dataset/custom_detection.yml --use-amp --seed=0 -t /kaggle/working/D-FINE/dfine_s_obj365.pth

And got error-
`state = torch.load(path, map_location='cpu')
[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 181, in load_tuning_state
[rank0]: adjusted_state_dict = self._adjust_head_parameters(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/train.py", line 84, in
[rank0]: main(args)
[rank0]: File "/kaggle/working/D-FINE/train.py", line 54, in main
[rank0]: solver.fit()
[rank0]: File "/kaggle/working/D-FINE/src/solver/det_solver.py", line 24, in fit
[rank0]: self.train()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 81, in train
[rank0]: self._setup()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 52, in _setup
[rank0]: self.load_tuning_state(self.cfg.tuning)
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 184, in load_tuning_state
[rank0]: stat, infos = self._matched_state(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'
E0127 11:36:46.344000 134127621440640 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 63) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-27_11:36:46
host : 64c81d2f379b
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 63)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`

The text was updated successfully, but these errors were encountered:

FarhAnonymous · 2025-02-06T05:15:03Z

This is common in Colab when trying to use multiple GPUs, as Colab typically only provides a single GPU.

!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 /content/drive/MyDrive/D-Fine/D-FINE-master/train.py -c /content/drive/MyDrive/D-Fine/D-FINE-master/configs/dfine/custom/dfine_hgnetv2_n_custom.yml --use-amp --seed=0

Removed the --master_port=7777 as it's not needed for single GPU training.

This should work better in the Colab environment since it will use only one GPU. The training might be slower than with multiple GPUs, but it will be more stable in the Colab environment.

dsbyprateekg · 2025-02-06T09:15:25Z

This is common in Colab when trying to use multiple GPUs, as Colab typically only provides a single GPU.

!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 /content/drive/MyDrive/D-Fine/D-FINE-master/train.py -c /content/drive/MyDrive/D-Fine/D-FINE-master/configs/dfine/custom/dfine_hgnetv2_n_custom.yml --use-amp --seed=0

Removed the --master_port=7777 as it's not needed for single GPU training.

This should work better in the Colab environment since it will use only one GPU. The training might be slower than with multiple GPUs, but it will be more stable in the Colab environment.

I tried but still got a new error-
`state = torch.load(path, map_location='cpu')
[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 181, in load_tuning_state
[rank0]: adjusted_state_dict = self._adjust_head_parameters(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/train.py", line 84, in
[rank0]: main(args)
[rank0]: File "/kaggle/working/D-FINE/train.py", line 54, in main
[rank0]: solver.fit()
[rank0]: File "/kaggle/working/D-FINE/src/solver/det_solver.py", line 24, in fit
[rank0]: self.train()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 81, in train
[rank0]: self._setup()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 52, in _setup
[rank0]: self.load_tuning_state(self.cfg.tuning)
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 184, in load_tuning_state
[rank0]: stat, infos = self._matched_state(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'
E0206 09:14:44.299000 133192790373504 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 108) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-02-06_09:14:44
host : 8effb7021c67
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 108)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`

FarhAnonymous · 2025-02-06T17:14:39Z

This error indicates the model initialization is failing because it can't load the pretrained weights properly. The key issue is: 'NoneType' object has no attribute 'state_dict', meaning the model object is None when trying to load weights.

Check your config file path is correct:

-c /path/to/config.yml

dsbyprateekg · 2025-02-07T05:23:33Z

This error indicates the model initialization is failing because it can't load the pretrained weights properly. The key issue is: 'NoneType' object has no attribute 'state_dict', meaning the model object is None when trying to load weights.

Check your config file path is correct:

-c /path/to/config.yml

I have put my config here: /kaggle/working/D-FINE/configs/dataset/custom_detection.yml

and used command !CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py -c /kaggle/working/D-FINE/configs/dataset/custom_detection.yml --use-amp --seed=0 -t /kaggle/working/D-FINE/dfine_s_obj2coco.pth

But getting same issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training command for single gpu training? #137

training command for single gpu training? #137

dsbyprateekg commented Jan 7, 2025 •

edited

Loading

FarhAnonymous commented Feb 6, 2025

dsbyprateekg commented Feb 6, 2025 •

edited

Loading

FarhAnonymous commented Feb 6, 2025

dsbyprateekg commented Feb 7, 2025

training command for single gpu training? #137

training command for single gpu training? #137

Comments

dsbyprateekg commented Jan 7, 2025 • edited Loading

train.py FAILED

Failures: <NO_OTHER_FAILURES>

FarhAnonymous commented Feb 6, 2025

dsbyprateekg commented Feb 6, 2025 • edited Loading

train.py FAILED

Failures: <NO_OTHER_FAILURES>

FarhAnonymous commented Feb 6, 2025

dsbyprateekg commented Feb 7, 2025

dsbyprateekg commented Jan 7, 2025 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

dsbyprateekg commented Feb 6, 2025 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>