-
-
Notifications
You must be signed in to change notification settings - Fork 16.8k
Improvement of DDP is needed! #463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This continues #177 #264 #401 for reference.
I will retrain to see how much Batchsize 64, V100. N means NoSyncBN. yolov5s. Train time from average 3 epochs. |
Let's wait on @NanoCode012 tests to decide on this. A cleaner codebase is nice, but faster training is nicer.
I looked at this and it seems like it would add some extra complication to train.py, but perhaps I'm not understanding how it would be implemented. We should keep in mind that we also want the code to be as minimal and as readable as possible. Sometimes it's easier to add features than to refactor and consolidate the code, but this second part is important.
Yes, this would be ideal. test and detect only use device 0 now. A more ambitious option might even be to merge these two files into a single eval.py, which would be responsible for all inference. test.py multi-gpu is commented out as it produced errors when running FP16 inference (I had to pick one, so I picked FP16, since it will help many single-gpus which are probably the dominant use case).
I can work on this myself later once anymore major changes are applied to train.py. Should also try to simplify datasets.py if possible as mentioned before.
yes |
Hello, my train is complete. Whew! I'm not sure what to make of it. My observation: Batchsize = 64, yolov5s, N means NoSyncBN, trained on V100s Code base origins
Edit: I am not sure how to visually show whether Edit 2:
If we were to change to logging, it would make sense to change it everywhere to stay consistent. For readability, I guess it will change from I saw a PR on making this repo a package for pip. It moves the python files into a |
@NanoCode012 yes I just saw the pip PR as well. It has a lot of changes that surprised me, I thought pip would be simpler. I don't think we want to move everything into another folder, but I'll chat with them to figure out what to do. That PR would come later on in about a week though after v2.0 is released, so it wouldn't affect any of this current work hopefully. Well from your perspective does mp.spawn simplify the code a lot or a little or not at all? Could you point me to the spawn branch? |
BTW the updated plots are awesome! Though like you said it's very hard to draw a conclusion. I think we can ignore DP for now, as it's clearly not the way to train multigpu, so if we just focus on DDP 2, 4, 8 it seems like launch has a slight edge. Do you know why this might be? Everything is a compromise, so we basically need to decide if mp.spawn helps in qualitative ways like reducing/simplifying the code and commands sufficiently to justify a bit slower training. Either that or try to determine where the slowdown occurs. |
# before
if local_rank in [-1, 0]:
print(something)
# after
# some set up in the begin of main().
logger.info(something) Use log is actually simpler and more powerful. |
https://github.com/MagicFrogSJTU/yolov5/tree/feature/mp_spawn
Spawn only simplifies the command to run it (no need
You might be confused when @MagicFrogSJTU and I were talking about improving readability.
I am not sure where it occurs. I will try to find out.
In my opinion, Edit: If we make DDP as default, it wouldn't even need to pass in a flag, so the command would be the same as before, if it detects >1 device, it'll go DDP mode, so no tutorial will be needed. It will just work. I haven't implemented this yet. |
@NanoCode012 ok, thanks for the explanation. Let's do this:
|
Okay!
Would this include @MagicFrogSJTU ’s attempt at multi gpu detect/test? Btw, should we try to improve the readability @MagicFrogSJTU ? When I was implementing the spawn, I realized that adding more variables like “distributed” bool to “opt” would require changes in create_dataloaders.. which could affect test.py unless we add more arguments to the create_dataloader function. (Could get messy)
Sure . I’ll try to get it done by today or tomorrow. Do I need to add mAP results to the tutorial as well or only speed?
I can re-run for 5L. Would take some time though. How many epochs would you like for me to run it for before I average (3 as usual)? |
Just an advice. @glenn-jocher Edit 0: |
I really liked it when I saw pytorch lightning. It encapsulates all the pieces away, leaving a simple .fit() but this is going to be a big change. |
@NanoCode012 don't worry about mAP. I think from the PR we have an understanding of how the mAP responds as the GPU count increases, also unfortunately 3 epochs is too noisy to compare really other than as a sanity check. Yes 3 epochs for 5l should be more than enough. I don't have any updates to test.py and detect.py, just some things I need to change in train.py. I think multi-gpu with DP is an easy change in test.py (the code is already there but commented out in favor of FP16), detect.py is more complicated as most inference is fixed to batch-size 1, and requires more structural changes for batched inference which would need dataloader changes as well. This is the reason I was thinking to merge detect and test eventually, as test.py already has batched inference in place. @MagicFrogSJTU that is a good idea. I see your example code uses logging also. train.py could really use some refactoring/simplifying. It's been packed with so many features that it's grown a bit unwieldy. Too many inlined additions and not enough subtractions. I want to remove the evolve code from it also and put it in it's own evolve.py file that would call train.py, and before I'd also discussed moving the hyps outside of it into their own coco_hyps.yaml. A Trainer() class would be really nice too. EDIT: Ok can I push one update to train.py then? It's a quick bug fix for cloud bucket upload problem I just noticed, and I did a few brief simplifications, but nothing major. |
Yes. Please. I am working on test.py |
Hi @glenn-jocher , I have added the tutorial #475 . I could not add the "Reproduce" section with all the badges, could you add that? Please feel free to modify it or tell me what needs to be changed (shorten it, make it collapsible, etc). @MagicFrogSJTU , could you please check if I made any mistakes. Thanks! |
To get 3. done, I have to resolve another issue yet. #485 |
I’m considering whether to attempt implementing the Trainer class or logging. I’m not sure how challenging it would be to make the Trainer class... Would like to try though For logging, we don’t have the “go” from glenn... |
@NanoCode012 I should have v2.0 released soon, probably later today. After that please git pull or reclone and then you should be good to start experimenting away :) One suggestion is that PR's are easiest to integrate the shorter they are. One-liners etc are the fastest. Perhaps if you split the changes into bite size chunks they would be much easier to analyze and integrate. So for example perhaps once v2.0 is out you could do just a 'logging' PR with nothing else? |
I saw one issue #501 on multi-node training. I could modify the code, but I do not have the machines to test. @MagicFrogSJTU , do you have multiple machines to test the code on? |
I have machine but don't have time to do this... I will check it later. |
I will write the code then. Will tell you when it's done. It's just a minor change. |
@MagicFrogSJTU , I'm done. It's here. https://github.com/NanoCode012/yolov5/tree/muti_node Command: # node 1
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 0,1 # node 2
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 2,3 Add |
@NanoCode012 Code is correct. The program run as expected.
Using 1080Ti, an epoch would cost nearly 30 minutes. |
@MagicFrogSJTU @NanoCode012 closing this issue and removing the TODO tag as basic DDP ops seem to be operating well now. Thank you for your contributions gentlemen! |
@NanoCode012 @glenn-jocher The orginal issue is too long. Please let's have it here.
🚀 Feature
The text was updated successfully, but these errors were encountered: