-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelized Queue implementation #301
Comments
The primary question is about features. The parallel programming part is not easy, but also not too hard, Basic assumptions:
Let's call the volume capacity V (how many volumes can be stored at the same time in memory), 1, Most basic approach: take V (e.g. 4) volumes and P (5) patches, and make (v,p) pairs, and randomize these: The loader logic would be: load a volume when there is less then V volume in memory, and discard volume This is somewhat close to the current implementation, except the part of waiting for the queue to be filled,
Or should be the last two batches merged, split into half, and make two shorter batch, 2, A volume can be loaded multiple times, but also can be discarded. This allows more randomization, 3, Some other approach. My proposal:
So a command stream would be something like this: Whenever a new sample is asked, the P #N sample will be returned. (L and D are only for memory management.) As far as i see, this is a flexible enough structure for implementing patient-based balancing, or class based balancing, or whatever. The specific implementations basically just yield their command stream, and this base class can handle the parallel loading. The question though: there is a limit for volumes being stored in parallel in memory (V). So how do you feel about this command stream design, and what do you think about a queue for patches? Remark: just like the current queue, it should have no workers in the dataloader, the queue will impement its own workers. |
Hi, So more or less i have a working implementation, but there are a few issues to sort out. But anyway, let's move to the next point.
I wrote two command stream generator, a basic optimizer, etc., all of them are simple functions. Finally, it makes life much easier if i can use the So currently these are the functions and classes:
command stream tools:
and the queue:
So one possible usage:
(Seed: for the subprocesses i need to set a random seed. Pytorch dataloaders do exactly the same. But it could be that we make a subclass, and we use something like this:
The question is about structure: subclass or separate command stream? Should we assume other queue implementations, or just put defaults into this class, Performance-wise: it looks quite good, if you have enough patches per volume, then basically the network never waits. :) How do you see it? |
Hi I am not a programer expert, but here are some comment
I would go with the last patch beeing shorter, no need to make it smarter because in pratice you have a total number of volumes >> nb volume per patch
isn't it too dangerous ? imagine your are in a case with very few augmentation (so that the queue is filled wery quick) , and a large model to compute in the GPU, ... there is a risk that the queue will continuously grow with time ... no ?
Sorry but I have difficulty to follow you here, and understand the 2 implementations parameter may be can you detail a little bit the 2 command stream arguments so that they perform the same random mixing: with 32 patch per subject I want the queue size of 32*4 to be sure to have 4 different subjects. Which parameter control this in your example
I do not understand what max_num_in_memory is controlling I hope @fepegar will merge it, anyway I will be please to test it |
The comparison is a bit misleading. So let's assume i have 4 subjects, 4 patches per subject, and a patch queue of 1. OK, then what is a patch queue?
I was imprecise, but it will not grow forever. In fact, most of the time it uses less memory than the classic queue. The issue comes when you reach the end samper. E.g. you took 4 patches, and now you need a new volume. It makes no sense to go above number_of_subjects_in_memory * number_of_patches_per_subject. So unlike the traditional queue where i have no idea about performance optimization, here the message is clear:
However,
Unfortunately, no. Both of them release any item as soon as possible, one by one, the data loader creates batches from them. Why is that important? As i said, you need some time to load the new volumes, so you pump up number of patches per subject and the queue size. Batch size might be e.g. 2, but the samples per subject is 64. In this case, the last 32 batch would get data only from one subject. However, if you merge the last two blocks, then you still get variety in the last block. So very briefly: all implementations work in the background and yield items as soon as possible.
For the 4 subjects, you need to set set max_no_subjects_in_mem to 4, and the patch_per_subject needs to be set to 32.
Yes, it is quite similar, but not exactly the same, especially when the different numbers are not multiplies of The conceptual difference is this briefly: the old queuing is split into two parts: a command generation and a command processing. The processing is very generic. The generation is the way to implement strategies, etc.
|
thanks for the detail, I think I now get a better understanding One last precision about the patch queue, |
That is a good point. First, in the current implementation there is no way to set number of processes, it will be equal to number of cores at max, or the number of subjets in memory +1 whichever is the smaller. (It uses a process pool, so the exact number is a bit complicated, but it tries to use all cores.) But the dataset is shared, so the number of processes here does not affect the memory usage. The number of worker process is a different story, see below. (It might be that there is a issue somewhere here, i haven't found any, but parallel programming with process pools is always tricky, so i don't promise anything. :) ) As of performance: i don't use dataloader anymore in the code, which has two benefits: it simplifies code, but also you can set the num_workers in the (main) dataloader to nonzero. If it is larger than 1, then it will consume more memory. Most likely the optimal size is 1, so not using the main thread, but having 1 subprocess, and inside this worker, the queue will generate N new subprocesses for its own job. If you use more than 1, you will double, tripple, etc., the memory use, but you will not gain speed, because the subprocesses will have less cores, so they will be proportionally slower. (Not to mention disk IO) But unlike the Queue implementation, it will not crash, just it will be slow. And because randomization is part of the queueing, so the shuffle option makes no sense in the dataloader. Beware that for trying the code, you need to pip-install loguru and unsync (added to the requirements). |
Hi thanks for the precision, About num_worker=1 in the main process, I wonder if it will not limit the number of visible cpu for this subprocess to 1 but you would have notice it, so I guess not ... |
Hi, Number of visible CPUs: i use the pure python multiprocessing tools, so pytorch don't limit (and cannot limit) the number of cores/processes. Also i got pretty decent results even with low number of subjects. (e.g. 3 subjects on a 16 core machine). I am not sure about optimal number of parallel subjects, maybe disk IO, or memory transfer speed, or whatever will limit the performance, and maybe on a 16 core machine the optimal number is like 8 or 12 or whatever. I guess this also depends on the specs, like storage, cpu, memory frequency, memory lanes, etc. Try it, i think it is easy to set up if you had the classic queue before, and let me know how it works. I will clean up the logging when there is a consensus that the algorithm works, and the details are clear enough. |
Hi @dvolgyes thanks for providing the new queue, it looks great In blue is the torchio.queue with length 80, 16 patch /volume and num_worker=6 So there is an improvement, but I have other test that does not show it, (I currently double check and report later) I have some memory issue that is growing during the iteration:
unfortunately this does not work with num_worker>0 (I get a small number not correct compare to the info from htop) Last point I was impress by the logging you make, usually with pytorch multiple num_worker, I do not get a proper meaning full error, here with your stuff I get a very nice report (with the variable content !!! amazing !) Also there is an improvement I am quite fare from a 100% working gpu, i guess this is a limitation of my settings, where time preparing the data is >> to computation time in GPU, so I need more cpu (and cpu memory) to go faster ... that is why I currently test on an other computer that have more cpus. |
There are a few questions, especially about the last graph: In my experience, the patch extraction is around a factor of 50-100 faster than loading the volumes. Assuming you have the same volume and patch sizes as me, and a similar machine. :)
Basically this is the issue: loading a volume takes T seconds. If everything is parallel, it is still T seconds. Anyway, i would recommend to experiment with larger queues, it should fit into memory (if not, reduce the The parallel queue can do the same, but it needs to be defined what you mean by "subjects in memory". (Speaking of memory: with parallel processes, actualy with many of them, i am not quite sure how to measure it. Side remark: unfortunately, nibabel is not very intelligent about caching, etc, so if you use You can also experiment with parallel prefetching from the data loader in your trainer. So, could you please perform a few more tests, |
Hi, I added a new implementation, let's call it double buffering. Also, unfortunately, i need an extra shared counter for some accounting, |
Hello The first problem is that I have a king of CPU memory cumulation that is growing a lot (during iteration) when I use your queue. (it seems worst with the new implementation, logical since you double the queue ...) I am not sure if it is due by your code, or if it is an error of my code, that is worst when I use the paraQueue than when I use the original one. So I end up with training again with the torchio.Queue. I seems I do also see the cpu max memory growing during iteration, but in a fewer amount, so I can managed it, and I do not get memory kill on the cluster ... A second point which is quite annoying with your paralleleQueue is when a dataloader get an error, Many thanks |
Hi, Memory: since it is also present (somewhat) in the classic queue, i would guess it is somewhere in your code,
ParallelQueue v2: I had a different issue, this is the fact that it uses a lot of locks and communication queues, Exception handling: i see, this is a good point. The design of loguru was that the main process could handle dead processes, e.g. if it is a worker process, it could be restarted, etc. But i see how this leads to issues in machine learning. The logging never meant to be permanent, it is just for debugging, but in next iteration i will try to make the exception propagating further in order to stop the whole training. Side note: in clusters it might be quite annoying to monitor event, e.g. crashed training. If you are allowed to use internet access (not always obvious), you could utilize this: Basically, you can write a short wrapper notifying yourself about major events, e.g. training start (assuming you have a batch system where it takes long time), or adding an extra call after your training, so you would know it ended. |
I need a few day break (i have some other pressing issues), but i will be back soon with a new variant.
There are two directions i consider:
Side remark: my data in (.nii files) files are very slow to read somehow. Does anybody has his/her own measurements about time of loading subjects vs time of preprocessing/augmentation? |
Yes of course for the different logging step, but The logging related to the error, is very good and convinient for understanding what is going on, So if it could be kept ... and @fepegar if there could be a way to generalize it to all torchio, |
I can imagine with huge dimension that it can be hard ... To solve it I started trying to split the process in two parts, It seems there is a lot of python library to automatically handle the caching, but I do not have many experience. No worry for the time, ( I have also have a lot of less interesting work to catch up ...) |
Hi both. I just want to say that I really want to look at this issue and corresponding PR, but I want to properly dedicate time to it and these last weeks have been very difficult (MICCAI, flying, self-isolating, moving, PhD...). I will go through this as soon as I can. |
Hi, Well, on the other hand, data loader cannot use GPUs, that is a limitation by pytorch+multiprocessing. We could experiment with returning full volumes, e.g. 4-6, with a regular pytorch dataloader, n workers, etc., and having a GPU based patch extraction tool,
|
So it seems that you're not yet convinced with your implementation. If you'd still like to have this in the package, we could add a I keep thinking about how to optimize transforms, loading, etc. There are some issues with multiprocessing in the new SimpleITK (SimpleITK/SimpleITK#1239), but maybe if they're fixed, we'll get better times for I believe kornia does have some 3D support now. |
I like the idea, of GPU transform: I do see it useful for whole brain approach, where you can not gain speed from the queue. As you said for patch based, the pytorch multiprocessing is intrinsically CPU. Intensity transform should be already GPU compatible since they rely only on torch. The difficulty comes with spatial transform For resampling and affine there are torch utiliity grid_sample and affine_grid. I tried once to play with it, and I suffer to get the same convention os nibable resampling (from the same affine) but it is doable. I did not make any comparison but anyway it will be application dependent, so I do think it is an important transformation to keep for sure !. From a theoretical point of view it does allow to create a really variety of geometry (which you can not get with only affine) for this I do consider it is important Then you have motion which rely on fft, here again there are gpu version, But it is a non negligible work ... |
Hi, |
Hello, |
Loguru: that is simple: it is not my project. It is very confusing if a project used multiple logging mechanisms, or only one subproject suddenly starts emitting log messages. Logging is a high level decision, also the log levels, the amount of information you need, etc., so i will not introduce a new dependency and a new logging scheme into the final product. :) |
Closing for now unless there's more activity. Thanks everyone! |
🚀 Feature
A processing queue implementation which has a limited capacity,
but immediately emits new data samples, and doesn't wait until the queue is full.
Motivation
Speed.
Pitch
Randomize in advance, but do not start the pre-loading until there is no empty spot in the queue.
Alternatives
Existing queue.
The text was updated successfully, but these errors were encountered: