-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training command for single gpu training? #137
Comments
This is common in Colab when trying to use multiple GPUs, as Colab typically only provides a single GPU. !CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 /content/drive/MyDrive/D-Fine/D-FINE-master/train.py -c /content/drive/MyDrive/D-Fine/D-FINE-master/configs/dfine/custom/dfine_hgnetv2_n_custom.yml --use-amp --seed=0 Removed the --master_port=7777 as it's not needed for single GPU training. This should work better in the Colab environment since it will use only one GPU. The training might be slower than with multiple GPUs, but it will be more stable in the Colab environment. |
I tried but still got a new error- [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last):
|
This error indicates the model initialization is failing because it can't load the pretrained weights properly. The key issue is: 'NoneType' object has no attribute 'state_dict', meaning the model object is None when trying to load weights. Check your config file path is correct: -c /path/to/config.yml |
I have put my config here: /kaggle/working/D-FINE/configs/dataset/custom_detection.yml and used command But getting same issue. |
Hi,
Can you please share what is the command for a custom training in a single GPU?
I want to use colab or kaggle for training.
I used following two commands but got same error-
!torchrun train.py -c /kaggle/working/D-FINE/configs/dfine/custom/objects365/dfine_hgnetv2_s_obj2custom.yml --use-amp --seed=0 -t /kaggle/working/D-FINE/dfine_s_obj365.pth
!CUDA_VISIBLE_DEVICES=0 torchrun --master_port=7777 --nproc_per_node=1 train.py -c /kaggle/working/D-FINE/configs/dataset/custom_detection.yml --use-amp --seed=0 -t /kaggle/working/D-FINE/dfine_s_obj365.pth
And got error-
`state = torch.load(path, map_location='cpu')
[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 181, in load_tuning_state
[rank0]: adjusted_state_dict = self._adjust_head_parameters(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/D-FINE/train.py", line 84, in
[rank0]: main(args)
[rank0]: File "/kaggle/working/D-FINE/train.py", line 54, in main
[rank0]: solver.fit()
[rank0]: File "/kaggle/working/D-FINE/src/solver/det_solver.py", line 24, in fit
[rank0]: self.train()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 81, in train
[rank0]: self._setup()
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 52, in _setup
[rank0]: self.load_tuning_state(self.cfg.tuning)
[rank0]: File "/kaggle/working/D-FINE/src/solver/_solver.py", line 184, in load_tuning_state
[rank0]: stat, infos = self._matched_state(module.state_dict(), pretrain_state_dict)
[rank0]: AttributeError: 'NoneType' object has no attribute 'state_dict'
E0127 11:36:46.344000 134127621440640 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 63) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-01-27_11:36:46
host : 64c81d2f379b
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 63)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`
The text was updated successfully, but these errors were encountered: