[PyTorch] Increase of training time by increasing epochs #2007
Replies: 92 comments
-
This is a commit hash for the What is the commit/version of Gramine itself that you used? Was it
Hm, but the in native runs ( |
Beta Was this translation helpful? Give feedback.
-
Is there a command to check Gramine version? However, I did |
Beta Was this translation helpful? Give feedback.
-
Thanks for the info. Yes, apparently you're using Gramine v1.4.
There is no command (yeah, I know, I know). But if you enable debug logs (
Fair enough. Could we ask you to run the experiment for a bit longer, e.g., for 1000 epochs? I wonder if this pattern continues. To date, I do not know any particular issue with Gramine that could lead to this behavior. Gramine has some perf bottlenecks, but this looks more like a resource leak? |
Beta Was this translation helpful? Give feedback.
-
Yes, I will run the experiment for 1000 epochs. Moreover, I have some txt files containing the federated learning experiments, where it is possible to see that first round of the federation takes more or less 3 minutes, while 100th round takes more than 30 minutes... These files are not well formatted, but are quite intuitive. If you want I can also post them. (Federated experiments have been conducted with OpenFL https://github.com/securefederatedai/openfl, that works with Gramine). Edit: another run 200 epochs without gramine (another run of 200 epochs without gramine) @dimakuv as you can see slowdown is really really low; however, to better understand if the problem is the machine and not Gramine, I am running the same experiment for 1000 epochs as you said. We will see if there is an increasing pattern. I will update you. |
Beta Was this translation helpful? Give feedback.
-
Why looks more like a resource leak? From a quick glance, I'd suspect issues related with the increasing memory usage along with the epoch increase. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone. I have just completed the 3 runs for 1000 epochs. Besides time, I have also collected current and peak memory for each epoch using Below you can find also the memory. Now, as you can see there is a lot of increase in time using Gramine, but no increase in memory (I need to calculate current and peak memory also for normal training, I know), so in my opinion problem is not memory. |
Beta Was this translation helpful? Give feedback.
-
Well, looks like Generally, the plots are very cool. But they also show that it's not a memory/resource leak:
So to me, this sounds like the problem of cold boot: Gramine is for some reason not behaving that well during startup, but then it arrives at a constant rate (which seems to be ~13% slower than native for |
Beta Was this translation helpful? Give feedback.
-
Sure, I am running again the experiment considering also the memory used for normal training.
In this case, we are talking about seconds, from 60 to 70 seconds after 1000 epochs. But, when the problem is bigger, as I said in the first post, the slowdown is too heavy. From 3 minutes for the first round to 30 minutes after 100 rounds... You can understand that it is not possible to work with this slowdown. |
Beta Was this translation helpful? Give feedback.
-
Ok, indeed, the problem is rooted somewhere in Gramine behavior. @CasellaJr How invested are you in this problem? Could you run a performance analysis? There are ways to profile Gramine-SGX, but this requires non-trivial engineering skills (build Gramine in |
Beta Was this translation helpful? Give feedback.
-
Yes @dimakuv , this morning I have already started this experiment: |
Beta Was this translation helpful? Give feedback.
-
Nice experiments! One ask for the future: could you pin the Y axis when rendering the charts? Right now all of them have different range and scale on Y which is a bit misleading ;) (doesn't matter that much in this particular case, but it makes it a bit harder to compare them visually) |
Beta Was this translation helpful? Give feedback.
-
Also, one thing which seems suspicious to me: why the running time is so noisy without Gramine but then gets very stable with it, both direct and SGX? It shouldn't look like this IMO. |
Beta Was this translation helpful? Give feedback.
-
I suggest you enable pre-heat optimization in manifest for your experiments - |
Beta Was this translation helpful? Give feedback.
-
I have also noted this strange behaviour. I do not know what really happens inside Gramine, so I do not have any clue about this less quantity of noise wrt normal training.
I will try also this option tomorrow, when the experiment for 1000 epoch using |
Beta Was this translation helpful? Give feedback.
-
If Gramine with libgomp enabled shows slower performance than normal Gramine, then I think that the best option is to go back to the previous setting with Debian11 and normal Gramine. It will be very good If in this case, with Gramine 1.5 I obtain better results. But, how can I use gramine 1.5 if the steps described in the guide are these:
|
Beta Was this translation helpful? Give feedback.
-
Ah, so you were already using the latest
Currently you can't because there is no 1.5 yet :) But after it is released, you'll just perform the same steps, and they will install the latest Gramine. |
Beta Was this translation helpful? Give feedback.
-
Yes, you are right... this thread is too long ahah
While the Dockerfile for Gramine with libgomp enabled is this:
|
Beta Was this translation helpful? Give feedback.
-
@dimakuv Hello dimakuv, how are you? |
Beta Was this translation helpful? Give feedback.
-
@CasellaJr Nothing special was changed in the last week: https://github.com/gramineproject/gramine/pulse#merged-pull-requests. I don't see anything that could affect performance. |
Beta Was this translation helpful? Give feedback.
-
If my paper will be accepted, for sure you will be in the acknowledgements 🤣 ❤️ |
Beta Was this translation helpful? Give feedback.
-
Hello @dimakuv
Do you think that this warning represents my problem? I mean the increase of training time |
Beta Was this translation helpful? Give feedback.
-
@CasellaJr Yes, definitely. This is a perf problem. If raw syscall instructions are frequent, then it may lead to a large perf degradation. To fix this warning, you need to use the patched But we've discussed this extensively, and you had the surprising result of having worse performance with patched |
Beta Was this translation helpful? Give feedback.
-
Ah ok, so this warning is referred to patched libgomp, ok. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone. |
Beta Was this translation helpful? Give feedback.
-
How many NUMA nodes does your machine have?
Yes, the CPU topology can affect the performance, e.g., if you have several NUMA domains, Gramine may spread enclave threads and enclave memory across them, which will lead to higher memory access latencies and overall worse performance. You may probably want to restrict Gramine to run to only one NUMA domain via e.g., Further, for such benchmark experiments, it's recommanded to limit the CPU cores where Linux will schedule enclave threads to by using core pinning ( |
Beta Was this translation helpful? Give feedback.
-
@CasellaJr Please see the reply from Kailun above, I have nothing to add to that reply. |
Beta Was this translation helpful? Give feedback.
-
Thank you guys, I will try! |
Beta Was this translation helpful? Give feedback.
-
Finally, our paper on Confidential Federated Learning has been accepted to the Deep Learning Security and Privacy Workshop 2024 in conjunction with IEEE Symposium on Security and Privacy. Thank you very much for all the effort you spent in helping me to overcome those heavy slowdowns. I will now try to work with Confidential Federated Learning on Intel TDX. |
Beta Was this translation helpful? Give feedback.
-
@CasellaJr: Is there anything actionable here? I.e. something we should change/fix in Gramine? Or is this just a thread for perf discussions and notes? If so, then I'll convert this into a GitHub Discussion, so it doesn't linger on our issue list. |
Beta Was this translation helpful? Give feedback.
-
Description of the problem
I have ran several Federated Learning experiments using the OpenFL framework developed by Intel, that is compatible with Gramine and SGX. My federation was made of 3 collaborators (3 different SGX machines) and one aggregator (another SGX machine). I have these 4 machines: 4x Baremetal 8380 ICX systems, Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz. During training I noted that training time was increasing after each round of training. I thought that the problem was with OpenFL; however, I profiled it and I did not found anything in the framework that can cause the slowdown. For this reason, I started simpler experiments; in particular, I ran typical centralized deep learning experiments using MNIST as dataset and Resnet18 as neural network. I ran 3 types of experiments:
python3 mnist.py
gramine-direct ./pytorch mnist.py
gramine-sgx ./pytorch mnist.py
I have followed the steps described in this PyTorch Gramine guide to run my Python script.
Below you can find the charts showing how training time grows "linearly".
Typical training time:
Non-SGX Gramine
SGX Gramine
Here you can find my Python script: pastebin
Steps to reproduce
Download the Python script and follow the steps described in this PyTorch Gramine guide. For each epoch of training will be printed metrics (accuracies and losses), time for each epoch ("et") and overall time ("tt").
Expected results
I expect that training time does not increase epoch by epoch.
Actual results
Time increases linearly.
Gramine commit hash
3be77927bbac64c2a4412f7e49dd5e0a59692b5b
Beta Was this translation helpful? Give feedback.
All reactions