Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-training SSD model on Windows #664

Closed
gapro20022 opened this issue Jul 30, 2020 · 9 comments
Closed

Re-training SSD model on Windows #664

gapro20022 opened this issue Jul 30, 2020 · 9 comments

Comments

@gapro20022
Copy link

gapro20022 commented Jul 30, 2020

I followed the tutorial and downloaded the model .pth and requirements.txt. However, the command prompt returns errors as I try to train the model with the dataset I picked using the downloader. Can you help me with this?

2020-07-30 23:30:44 - Using CUDA...
2020-07-30 23:30:44 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/XBIN', dataset_type='open_images', datasets=['data'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2020-07-30 23:30:44 - Prepare training datasets.
2020-07-30 23:30:44 - loading annotations from: data/sub-train-annotations-bbox.csv
2020-07-30 23:30:44 - annotations loaded from: data/sub-train-annotations-bbox.csv
num images: 238
2020-07-30 23:30:44 - Dataset Summary:Number of Images: 238
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 732
Box: 62
Drink: 270
Drinking straw: 3
Plastic bag: 17
Tin can: 24
2020-07-30 23:30:44 - Stored labels into file models/XBIN\labels.txt.
2020-07-30 23:30:44 - Train dataset size: 238
2020-07-30 23:30:44 - Prepare Validation datasets.
2020-07-30 23:30:44 - loading annotations from: data/sub-test-annotations-bbox.csv
2020-07-30 23:30:44 - annotations loaded from: data/sub-test-annotations-bbox.csv
num images: 1589
2020-07-30 23:30:46 - Dataset Summary:Number of Images: 1589
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 957
Box: 252
Drink: 1408
Drinking straw: 17
Plastic bag: 19
Tin can: 183
2020-07-30 23:30:46 - Validation dataset size: 1589
2020-07-30 23:30:46 - Build network.
2020-07-30 23:30:46 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2020-07-30 23:30:46 - Took 0.07 seconds to load the model.
2020-07-30 23:30:48 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2020-07-30 23:30:48 - Uses CosineAnnealingLR scheduler.
2020-07-30 23:30:48 - Start training from epoch 0.
C:\Users------\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\optim\lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Traceback (most recent call last):
File "train_ssd.py", line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 113, in train
for i, data in enumerate(loader):
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter
return _MultiProcessingDataLoaderIter(self)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init
w.start()
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TrainAugmentation.init..'

C:\Users-----\Desktop\jetson-inference-master\python\training\detection\ssd>2020-07-30 23:30:50 - Using CUDA...
Traceback (most recent call last):
File "", line 1, in
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

@dusty-nv
Copy link
Owner

This hasn't been tested or supported on Windows. Have you tried training it on your Jetson?

@gapro20022
Copy link
Author

I'm not with my Jetson right now, so I don't know. I thought it'd be better to train on my computer because it trains quicker. I'll try training it on my Jetson when I have it. In the meantime, if you could figure out the problem, that'd be really nice!

Thank you!

@dusty-nv
Copy link
Owner

I train this on my Linux laptop as well (Ubuntu 16.04/18.04) without issue - it seems an error related to Windows.

See this related post - qfgaohao/pytorch-ssd#71 (comment)

@gapro20022
Copy link
Author

Adding --num-workers=0 made it worked although it does still show some warnings. But hey, it's actually training!

C:\Users-----\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\optim\lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
C:\Users-----\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))

@AiueoABC
Copy link

I'm not sure this is helpful, but in my case (I also use windows10), multiprocessing made an error like you when I used this pytorch-ssd.
To fix this, I rewrote "import pickle" to "import dill as pickle" in the file at "C:\Python36\lib\multiprocessing\reduction.py"
Before runnning, I had to install dill using "pip install dill==0.3.0"

@steel540
Copy link

I'm not sure this is helpful, but in my case (I also use windows10), multiprocessing made an error like you when I used this pytorch-ssd.
To fix this, I rewrote "import pickle" to "import dill as pickle" in the file at "C:\Python36\lib\multiprocessing\reduction.py"
Before runnning, I had to install dill using "pip install dill==0.3.0"

it's work, my lists of torch 1.7.0, cuda 10.1 in win10 with 1070 laptop, thank you!

@dasmehdix
Copy link

I'm not sure this is helpful, but in my case (I also use windows10), multiprocessing made an error like you when I used this pytorch-ssd.
To fix this, I rewrote "import pickle" to "import dill as pickle" in the file at "C:\Python36\lib\multiprocessing\reduction.py"
Before runnning, I had to install dill using "pip install dill==0.3.0"

Adding --num-workers=0 made it worked although it does still show some warnings. But hey, it's actually training!

C:\Users-----\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\optim\lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
C:\Users-----\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))

Both solutions worked for my enviroment(windows 10)! Thanks. @AiueoABC @gapro20022

@kueblert
Copy link

kueblert commented Sep 7, 2021

Alternatively, in case you don't want the dill dependency and still profit from multithreading, replacing the lambda function in TrainAugmentation with the following worked for me:

class ScaleByStd:
def init(self, std):
self.std = std

def __call__(self, image, boxes=None, labels=None):
    return image / self.std, boxes, labels

@PrayogaBoedihartoyo
Copy link

maybe some case can try replace workers=4 to be --num-workers=0
some case maybe warning but the're still running

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants