Re-training SSD model on Windows #664

gapro20022 · 2020-07-30T16:39:10Z

I followed the tutorial and downloaded the model .pth and requirements.txt. However, the command prompt returns errors as I try to train the model with the dataset I picked using the downloader. Can you help me with this?

2020-07-30 23:30:44 - Using CUDA...
2020-07-30 23:30:44 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/XBIN', dataset_type='open_images', datasets=['data'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2020-07-30 23:30:44 - Prepare training datasets.
2020-07-30 23:30:44 - loading annotations from: data/sub-train-annotations-bbox.csv
2020-07-30 23:30:44 - annotations loaded from: data/sub-train-annotations-bbox.csv
num images: 238
2020-07-30 23:30:44 - Dataset Summary:Number of Images: 238
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 732
Box: 62
Drink: 270
Drinking straw: 3
Plastic bag: 17
Tin can: 24
2020-07-30 23:30:44 - Stored labels into file models/XBIN\labels.txt.
2020-07-30 23:30:44 - Train dataset size: 238
2020-07-30 23:30:44 - Prepare Validation datasets.
2020-07-30 23:30:44 - loading annotations from: data/sub-test-annotations-bbox.csv
2020-07-30 23:30:44 - annotations loaded from: data/sub-test-annotations-bbox.csv
num images: 1589
2020-07-30 23:30:46 - Dataset Summary:Number of Images: 1589
Minimum Number of Images for a Class: -1
Label Distribution:
Bottle: 957
Box: 252
Drink: 1408
Drinking straw: 17
Plastic bag: 19
Tin can: 183
2020-07-30 23:30:46 - Validation dataset size: 1589
2020-07-30 23:30:46 - Build network.
2020-07-30 23:30:46 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2020-07-30 23:30:46 - Took 0.07 seconds to load the model.
2020-07-30 23:30:48 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2020-07-30 23:30:48 - Uses CosineAnnealingLR scheduler.
2020-07-30 23:30:48 - Start training from epoch 0.
C:\Users------\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\optim\lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Traceback (most recent call last):
File "train_ssd.py", line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 113, in train
for i, data in enumerate(loader):
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter
return _MultiProcessingDataLoaderIter(self)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init
w.start()
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TrainAugmentation.init..'

C:\Users-----\Desktop\jetson-inference-master\python\training\detection\ssd>2020-07-30 23:30:50 - Using CUDA...
Traceback (most recent call last):
File "", line 1, in
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users------\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

The text was updated successfully, but these errors were encountered:

dusty-nv · 2020-07-30T16:45:27Z

This hasn't been tested or supported on Windows. Have you tried training it on your Jetson?

gapro20022 · 2020-07-30T16:48:29Z

I'm not with my Jetson right now, so I don't know. I thought it'd be better to train on my computer because it trains quicker. I'll try training it on my Jetson when I have it. In the meantime, if you could figure out the problem, that'd be really nice!

Thank you!

dusty-nv · 2020-07-30T16:58:38Z

I train this on my Linux laptop as well (Ubuntu 16.04/18.04) without issue - it seems an error related to Windows.

See this related post - qfgaohao/pytorch-ssd#71 (comment)

gapro20022 · 2020-07-30T17:10:20Z

Adding --num-workers=0 made it worked although it does still show some warnings. But hey, it's actually training!

C:\Users-----\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\optim\lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
C:\Users-----\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))

AiueoABC · 2020-08-14T09:47:56Z

I'm not sure this is helpful, but in my case (I also use windows10), multiprocessing made an error like you when I used this pytorch-ssd.
To fix this, I rewrote "import pickle" to "import dill as pickle" in the file at "C:\Python36\lib\multiprocessing\reduction.py"
Before runnning, I had to install dill using "pip install dill==0.3.0"

steel540 · 2021-05-10T09:20:27Z

I'm not sure this is helpful, but in my case (I also use windows10), multiprocessing made an error like you when I used this pytorch-ssd.
To fix this, I rewrote "import pickle" to "import dill as pickle" in the file at "C:\Python36\lib\multiprocessing\reduction.py"
Before runnning, I had to install dill using "pip install dill==0.3.0"

it's work, my lists of torch 1.7.0, cuda 10.1 in win10 with 1070 laptop, thank you!

dasmehdix · 2021-08-31T08:20:03Z

I'm not sure this is helpful, but in my case (I also use windows10), multiprocessing made an error like you when I used this pytorch-ssd.
To fix this, I rewrote "import pickle" to "import dill as pickle" in the file at "C:\Python36\lib\multiprocessing\reduction.py"
Before runnning, I had to install dill using "pip install dill==0.3.0"

Adding --num-workers=0 made it worked although it does still show some warnings. But hey, it's actually training!

C:\Users-----\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\optim\lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
C:\Users-----\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))

Both solutions worked for my enviroment(windows 10)! Thanks. @AiueoABC @gapro20022

kueblert · 2021-09-07T06:01:16Z

Alternatively, in case you don't want the dill dependency and still profit from multithreading, replacing the lambda function in TrainAugmentation with the following worked for me:

class ScaleByStd:
def init(self, std):
self.std = std

def __call__(self, image, boxes=None, labels=None):
    return image / self.std, boxes, labels

PrayogaBoedihartoyo · 2021-10-28T04:00:10Z

maybe some case can try replace workers=4 to be --num-workers=0
some case maybe warning but the're still running

dusty-nv closed this as completed Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-training SSD model on Windows #664

Re-training SSD model on Windows #664

gapro20022 commented Jul 30, 2020 •

edited

Loading

dusty-nv commented Jul 30, 2020

gapro20022 commented Jul 30, 2020

dusty-nv commented Jul 30, 2020

gapro20022 commented Jul 30, 2020

AiueoABC commented Aug 14, 2020

steel540 commented May 10, 2021

dasmehdix commented Aug 31, 2021

kueblert commented Sep 7, 2021

PrayogaBoedihartoyo commented Oct 28, 2021

Re-training SSD model on Windows #664

Re-training SSD model on Windows #664

Comments

gapro20022 commented Jul 30, 2020 • edited Loading

dusty-nv commented Jul 30, 2020

gapro20022 commented Jul 30, 2020

dusty-nv commented Jul 30, 2020

gapro20022 commented Jul 30, 2020

AiueoABC commented Aug 14, 2020

steel540 commented May 10, 2021

dasmehdix commented Aug 31, 2021

kueblert commented Sep 7, 2021

PrayogaBoedihartoyo commented Oct 28, 2021

gapro20022 commented Jul 30, 2020 •

edited

Loading