Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The training model reported an error! #1

Open
Kyle-fang opened this issue Sep 20, 2023 · 3 comments
Open

The training model reported an error! #1

Kyle-fang opened this issue Sep 20, 2023 · 3 comments

Comments

@Kyle-fang
Copy link

python main_self_supervised.py --config configs\stl10_self_supervised.yaml

E:\Anaconda\envs\mv_mr\lib\site-packages\torchvision\models_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
E:\Anaconda\envs\mv_mr\lib\site-packages\torchvision\models_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=None.
warnings.warn(msg)
E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\connectors\accelerator_connector.py:898: UserWarning: You are running on single node with no parallelization, so distributed has no effect.
rank_zero_warn("You are running on single node with no parallelization, so distributed has no effect.")
E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\connectors\accelerator_connector.py:658: UserWarning: You passed Trainer(accelerator='cpu', precision=16) but native AMP is not supported on CPU. Using precision='bf16' instead.
rank_zero_warn(
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:291: LightningDeprecationWarning: Base Callback.on_train_batch_end hook signature has changed in v1.5. The dataloader_idx argument will be removed in v1.7.
rank_zero_deprecation(
Files already downloaded and verified
Files already downloaded and verified
Missing logger folder: lightning_logs\2023-09-19-15-52-00_stl10_self_supervised_scale_z
Files already downloaded and verified
Files already downloaded and verified

| Name | Type | Params

0 | _encoder | ResnetMultiProj | 174 M
1 | _loss_dc | DistanceCorrelation | 0
2 | _scatnet | Scattering2D | 0
3 | _hog | HOGLayer | 0
4 | _identity | Identity | 0
5 | online_finetuner | Linear | 20.5 K

174 M Trainable params
0 Non-trainable params
174 M Total params
698.194 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]Files already downloaded and verified
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
File "F:\Fangweijie\mv-mr-main\main_self_supervised.py", line 92, in
main(args)
File "F:\Fangweijie\mv-mr-main\main_self_supervised.py", line 74, in main
trainer.fit(module, ckpt_path=path_checkpoint)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1199, in _run
self._dispatch()
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1289, in run_stage
return self._run_train()
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1311, in _run_train
self._run_sanity_check(self.lightning_module)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1375, in _run_sanity_check
self._evaluation_loop.run()
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\base.py", line 145, in run
self.advance(*args, **kwargs)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\base.py", line 145, in run
self.advance(*args, **kwargs)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 239, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 219, in validation_step
return self.model.validation_step(*args, **kwargs)
File "F:\Fangweijie\mv-mr-main\src\model\self_supervised_module.py", line 277, in validation_step
return self.step(batch, batch_idx, stage='val')
File "F:\Fangweijie\mv-mr-main\src\model\self_supervised_module.py", line 260, in step
loss_dc = self._step_dc(im_orig, z_scaled, representation)
File "F:\Fangweijie\mv-mr-main\src\model\self_supervised_module.py", line 194, in _step_dc
im_orig_hog = self._hog(im_orig_r)
File "E:\Anaconda\envs\mv_mr\lib\site-packages\torch\nn\modules\module.py", line 1130, in call_impl
return forward_call(*input, **kwargs)
File "F:\Fangweijie\mv-mr-main\src\model\hog.py", line 39, in forward
out.scatter
(1, phase_int.floor().long() % self.nbins, norm)
RuntimeError: scatter(): Expected self.dtype to be equal to src.dtype

@Kyle-fang
Copy link
Author

Here is my configuration file:

batch_size: 64 # batch size for training
epochs: 100 # number of epochs to train for
warmup_epochs: 10 # number of warmup epochs (without learning rate decay)
log_every: 100 # frequency of logging (steps)
eval_every: 1 # frequency of evaluating on val set (epochs)
n_workers: 8 # number of workers for dataloader
fp16: True # whether to use fp16 precision
accumulate_grad_batches: 1 # number of accumulation steps
normalize_z: False # whether to normalize z in encoder

fine_tune_from: # path to pre-trained model to fine-tune from

std_margin: 1 # margin for the std values in the loss

optimizer: adam # optimizer to use. Options: adam, adamw
wd: 1e-6 # weight decay
lr: 1e-4 # learning rate
scheduler: warmup_cosine # type of scheduler to use. Options: cosine, multistep, warmup_cosine

dataset: # dataset parameters
name: stl10 # dataset name
size: 96 # image size
n_classes: 10 # number of classes
path: # path to dataset (leave empty)
aug_policy: custom # augmentation policy. Choices: autoaugment, randaugment, custom

scatnet: # scatnet parameters
J: 2 # number of scales
shape: # shape of the input image
- 96 # height
- 96 # width
L: 8 # number of rotations (filters per scale)

encoder_type: resnet # choices: resnet, deit
encoder: # encoder parameters
out_dim: 8192-8192-8192 # number of neurons in the projection layer
small_kernel: True # whether to use small kernels. Small kernels are used for STL10 dataset

hog: # histogram of oriented gradients parameters
nbins: 24 # number of bins
pool: 8 # pooling size

comment: stl10_self_supervised_scale_z

@Kyle-fang
Copy link
Author

The dataset has also been set up!

无标题

@vkinakh
Copy link
Owner

vkinakh commented Sep 20, 2023

You are using 16bit precision, while training on CPU. I do not think that is possible to do. Try changing the precision to 32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants