[PyTorch] Increase of training time by increasing epochs #2007

CasellaJr · 2023-03-27T13:30:13Z

CasellaJr
Mar 27, 2023

Description of the problem

I have ran several Federated Learning experiments using the OpenFL framework developed by Intel, that is compatible with Gramine and SGX. My federation was made of 3 collaborators (3 different SGX machines) and one aggregator (another SGX machine). I have these 4 machines: 4x Baremetal 8380 ICX systems, Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz. During training I noted that training time was increasing after each round of training. I thought that the problem was with OpenFL; however, I profiled it and I did not found anything in the framework that can cause the slowdown. For this reason, I started simpler experiments; in particular, I ran typical centralized deep learning experiments using MNIST as dataset and Resnet18 as neural network. I ran 3 types of experiments:

Typical training with python3 mnist.py
Non-SGX Gramine with gramine-direct ./pytorch mnist.py
SGX Gramine with gramine-sgx ./pytorch mnist.py

I have followed the steps described in this PyTorch Gramine guide to run my Python script.

Below you can find the charts showing how training time grows "linearly".

Typical training time:

Non-SGX Gramine

SGX Gramine

Here you can find my Python script: pastebin

Steps to reproduce

Download the Python script and follow the steps described in this PyTorch Gramine guide. For each epoch of training will be printed metrics (accuracies and losses), time for each epoch ("et") and overall time ("tt").

Expected results

I expect that training time does not increase epoch by epoch.

Actual results

Time increases linearly.

Gramine commit hash

3be77927bbac64c2a4412f7e49dd5e0a59692b5b

dimakuv · 2023-03-28T08:00:33Z

dimakuv
Mar 28, 2023

Gramine commit hash

3be77927bbac64c2a4412f7e49dd5e0a59692b5b

This is a commit hash for the examples repo: gramineproject/examples@3be7792

What is the commit/version of Gramine itself that you used? Was it v1.4, or latest master, or something else?

Expected results

I expect that training time does not increase epoch by epoch.

Hm, but the in native runs (python3 mnist.py) the training time also seems to increase epoch by epoch? At a slow rate, but still it increases, looking at your first diagram. So maybe Gramine-SGX just amplifies this pattern?

0 replies

CasellaJr · 2023-03-28T08:48:59Z

CasellaJr
Mar 28, 2023
Author

Is there a command to check Gramine version? However, I did sudo apt-get install gramine, and I have gramine is already the newest version (1.4).. So, Gramine version is 1.4.
I see that there is a very slow rate increase in python3 mnist.py, however it is less than a second after 200 epochs, that can be justified also by measurement error in my opinion.

0 replies

dimakuv · 2023-03-28T08:58:34Z

dimakuv
Mar 28, 2023

Thanks for the info. Yes, apparently you're using Gramine v1.4.

Is there a command to check Gramine version?

There is no command (yeah, I know, I know). But if you enable debug logs (loader.log_level = "debug"), then Gramine prints its version in one of the log lines.

I see that there is a very slow rate increase in python3 mnist.py, however it is less than a second after 200 epochs, that can be justified also by measurement error in my opinion.

Fair enough. Could we ask you to run the experiment for a bit longer, e.g., for 1000 epochs? I wonder if this pattern continues.

To date, I do not know any particular issue with Gramine that could lead to this behavior. Gramine has some perf bottlenecks, but this looks more like a resource leak?

0 replies

CasellaJr · 2023-03-28T09:05:25Z

CasellaJr
Mar 28, 2023
Author

Yes, I will run the experiment for 1000 epochs.
In the meantime, I am running the same experiment for 200 epochs again in order to check if it was a measurement error (or something similar) or if the pattern depends on something else. I will edit this answer with the new plot as soon as training finishes.
Then I will run for 1000 epochs.
Finally, I will try on another group of SGX machines, if the problem persists.

Moreover, I have some txt files containing the federated learning experiments, where it is possible to see that first round of the federation takes more or less 3 minutes, while 100th round takes more than 30 minutes... These files are not well formatted, but are quite intuitive. If you want I can also post them. (Federated experiments have been conducted with OpenFL https://github.com/securefederatedai/openfl, that works with Gramine).

Edit: another run 200 epochs without gramine (another run of 200 epochs without gramine) @dimakuv as you can see slowdown is really really low; however, to better understand if the problem is the machine and not Gramine, I am running the same experiment for 1000 epochs as you said. We will see if there is an increasing pattern. I will update you.

0 replies

kailun-qin · 2023-03-28T12:00:44Z

kailun-qin
Mar 28, 2023
Maintainer

Gramine has some perf bottlenecks, but this looks more like a resource leak?

Why looks more like a resource leak? From a quick glance, I'd suspect issues related with the increasing memory usage along with the epoch increase.

0 replies

CasellaJr · 2023-03-31T11:36:10Z

CasellaJr
Mar 31, 2023
Author

Hello everyone. I have just completed the 3 runs for 1000 epochs. Besides time, I have also collected current and peak memory for each epoch using tracemalloc. I followed this guide to trace memory: tracemalloc, but I do not know if it is the best method to measure it on Gramine and SGX. If it is not correct, please suggest me the correct way 😄
Now, below I attach the results for 1000 epochs. In my opinion there is no need for comments 😃

Normal training time:

Non-SGX Gramine:

SGX Gramine:

Below you can find also the memory.
Current memory non-SGX:

Current memory SGX:

Peak memory non-SGX:

Now, as you can see there is a lot of increase in time using Gramine, but no increase in memory (I need to calculate current and peak memory also for normal training, I know), so in my opinion problem is not memory.
The idea is to test these scenarios using something CPU-heavy process, like the calculation of Pi, or something like stress-ng. If you have suggestions are welcome.

0 replies

dimakuv · 2023-03-31T12:01:21Z

dimakuv
Mar 31, 2023

Now, as you can see there is a lot of increase in time using Gramine, but no increase in memory (I need to calculate current and peak memory also for normal training, I know), so in my opinion problem is not memory.

Well, looks like tracemalloc shows incorrect data for Gramine. Or maybe it shows incorrect data for PyTorch in general? Please check what is shown for normal training...

Generally, the plots are very cool. But they also show that it's not a memory/resource leak:

gramine-direct stabilizes after ~700 epochs
gramine-sgx stabilizes after ~500 epochs

So to me, this sounds like the problem of cold boot: Gramine is for some reason not behaving that well during startup, but then it arrives at a constant rate (which seems to be ~13% slower than native for gramine-direct and ~30% slower than native for gramine-sgx). Not too bad.

0 replies

CasellaJr · 2023-03-31T17:57:40Z

CasellaJr
Mar 31, 2023
Author

Well, looks like tracemalloc shows incorrect data for Gramine. Or maybe it shows incorrect data for PyTorch in general? Please check what is shown for normal training...

Sure, I am running again the experiment considering also the memory used for normal training.

So to me, this sounds like the problem of cold boot: Gramine is for some reason not behaving that well during startup, but then it arrives at a constant rate (which seems to be ~13% slower than native for gramine-direct and ~30% slower than native for gramine-sgx). Not too bad.

In this case, we are talking about seconds, from 60 to 70 seconds after 1000 epochs. But, when the problem is bigger, as I said in the first post, the slowdown is too heavy. From 3 minutes for the first round to 30 minutes after 100 rounds... You can understand that it is not possible to work with this slowdown.

0 replies

CasellaJr · 2023-04-02T09:18:06Z

CasellaJr
Apr 2, 2023
Author

Here are the current and peak memory for normal training:

0 replies

dimakuv · 2023-04-03T10:36:05Z

dimakuv
Apr 3, 2023

Ok, indeed, the problem is rooted somewhere in Gramine behavior.

@CasellaJr How invested are you in this problem? Could you run a performance analysis? There are ways to profile Gramine-SGX, but this requires non-trivial engineering skills (build Gramine in debugoptimized mode, run it with special manifest parameters, analyze the resulting .perf data). You can check here: https://gramine.readthedocs.io/en/stable/performance.html#sgx-profiling

0 replies

CasellaJr · 2023-04-03T10:41:50Z

CasellaJr
Apr 3, 2023
Author

Yes @dimakuv , this morning I have already started this experiment: sudo perf record --call-graph dwarf -F 50 -e cpu-clock gramine-direct ./pytorch mnist.py. I will update you when it will finish.

0 replies

mkow · 2023-04-03T16:00:48Z

mkow
Apr 3, 2023
Maintainer

Nice experiments! One ask for the future: could you pin the Y axis when rendering the charts? Right now all of them have different range and scale on Y which is a bit misleading ;) (doesn't matter that much in this particular case, but it makes it a bit harder to compare them visually)

0 replies

mkow · 2023-04-03T16:03:04Z

mkow
Apr 3, 2023
Maintainer

Also, one thing which seems suspicious to me: why the running time is so noisy without Gramine but then gets very stable with it, both direct and SGX? It shouldn't look like this IMO.

0 replies

monavij · 2023-04-03T17:02:08Z

monavij
Apr 3, 2023
Collaborator

I suggest you enable pre-heat optimization in manifest for your experiments - sgx.preheat_enclave = true – pre-fault all enclave pages during enclave initialization. My guess is once you bring all pages in in you start stabilizing. You can look at other optimizations as well at https://gramine.readthedocs.io/en/stable/performance.html

0 replies

CasellaJr · 2023-04-03T18:51:08Z

CasellaJr
Apr 3, 2023
Author

Also, one thing which seems suspicious to me: why the running time is so noisy without Gramine but then gets very stable with it, both direct and SGX? It shouldn't look like this IMO.

I have also noted this strange behaviour. I do not know what really happens inside Gramine, so I do not have any clue about this less quantity of noise wrt normal training.

I suggest you enable pre-heat optimization in manifest for your experiments - sgx.preheat_enclave = true – pre-fault all enclave pages during enclave initialization. My guess is once you bring all pages in in you start stabilizing.

I will try also this option tomorrow, when the experiment for 1000 epoch using perf will finish. Moreover, I will upload here the perf.data because I am not sure on how to read the results.

0 replies

CasellaJr · 2023-05-16T13:10:09Z

CasellaJr
May 16, 2023
Author

If Gramine with libgomp enabled shows slower performance than normal Gramine, then I think that the best option is to go back to the previous setting with Debian11 and normal Gramine. It will be very good If in this case, with Gramine 1.5 I obtain better results. But, how can I use gramine 1.5 if the steps described in the guide are these:

sudo curl -fsSLo /usr/share/keyrings/gramine-keyring.gpg https://packages.gramineproject.io/gramine-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/gramine-keyring.gpg] https://packages.gramineproject.io/ $(lsb_release -sc) main" \
| sudo tee /etc/apt/sources.list.d/gramine.list
sudo apt-get update
sudo apt-get install gramine

0 replies

dimakuv · 2023-05-16T13:19:43Z

dimakuv
May 16, 2023

Ah, so you were already using the latest master branch of Gramine in your experiments? Sorry, this thread is so long that I forget what you've done.

But, how can I use gramine 1.5 if the steps described in the guide are these:

Currently you can't because there is no 1.5 yet :) But after it is released, you'll just perform the same steps, and they will install the latest Gramine.

0 replies

CasellaJr · 2023-05-16T13:45:05Z

CasellaJr
May 16, 2023
Author

Yes, you are right... this thread is too long ahah
So, just to recap:
When I used normal Gramine, this is my Dockerfile:

ARG BASE_IMAGE=python:3.8
FROM ${BASE_IMAGE}

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

RUN --mount=type=cache,target=/root/.cache/ \
    pip install --upgrade pip && \
    git clone --single-branch --branch develop https://github.com/intel/openfl.git && \
    pip install ./openfl && \
    pip install -U py-grpc-profile && \
    rm /usr/local/lib/python3.8/site-packages/openfl-gramine/openfl.manifest.template && \
    rm /usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py && \
    rm /usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.py && \
    rm /usr/local/lib/python3.8/site-packages/openfl/databases/tensor_db.py && \
    rm /usr/local/lib/python3.8/site-packages/openfl/protocols/utils.py && \
    rm /usr/local/lib/python3.8/site-packages/openfl/federated/task/runner_pt.py

COPY openfl.manifest.template /usr/local/lib/python3.8/site-packages/openfl-gramine/
COPY collaborator.cazzi /usr/local/lib/python3.8/site-packages/openfl/component/collaborator/
COPY aggregator.cazzi /usr/local/lib/python3.8/site-packages/openfl/component/aggregator/
COPY tensor_db.cazzi /usr/local/lib/python3.8/site-packages/openfl/databases/
COPY utils.cazzi /usr/local/lib/python3.8/site-packages/openfl/protocols/
COPY runner_pt.cazzi /usr/local/lib/python3.8/site-packages/openfl/federated/task

RUN mv /usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.cazzi /usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py
RUN mv /usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.cazzi /usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.py
RUN mv /usr/local/lib/python3.8/site-packages/openfl/databases/tensor_db.cazzi /usr/local/lib/python3.8/site-packages/openfl/databases/tensor_db.py
RUN mv /usr/local/lib/python3.8/site-packages/openfl/protocols/utils.cazzi /usr/local/lib/python3.8/site-packages/openfl/protocols/utils.py
RUN mv /usr/local/lib/python3.8/site-packages/openfl/federated/task/runner_pt.cazzi /usr/local/lib/python3.8/site-packages/openfl/federated/task/runner_pt.py

# install gramine
RUN curl -fsSLo /usr/share/keyrings/gramine-keyring.gpg https://packages.gramineproject.io/gramine-keyring.gpg && \
    echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/gramine-keyring.gpg] https://packages.gramineproject.io/ stable main' | \
    tee /etc/apt/sources.list.d/gramine.list
RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
    apt-get update && \
    apt-get install -y --no-install-recommends \
    gramine libprotobuf-c-dev \
    && rm -rf /var/lib/apt/lists/*
    # there is an issue for libprotobuf-c in gramine repo, install from apt for now

# graminelibos is under this dir
ENV PYTHONPATH=/usr/local/lib/python3.8/site-packages/:/usr/lib/python3/dist-packages/:

#set environment variables (threads)
ENV OMP_NUM_THREADS=40
ENV OPENBLAS_NUM_THREADS=40
ENV MKL_NUM_THREADS=40
ENV VECLIB_MAXIMUM_THREADS=40
ENV NUMEXPR_NUM_THREADS=40

# install linux headers
# WORKDIR /tmp/
# RUN wget -c https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.11/amd64/linux-headers-5.11.0-051100_5.11.0-051100.202102142330_all.deb
# RUN dpkg -i *.deb
# RUN mv /usr/src/linux-headers-5.11.0-051100/ /usr/src/linux-headers-5.11.0-051100rc5-generic/
# WORKDIR /

# ENV LC_ALL=C.UTF-8
# ENV LANG=C.UTF-8

While the Dockerfile for Gramine with libgomp enabled is this:

FROM ubuntu:22.04

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

RUN apt-get update
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata

RUN apt-get install -y build-essential \
    autoconf bison gawk nasm ninja-build pkg-config python3 python3-click \
    python3-jinja2 python3-pip python3-pyelftools wget git curl build-essential python-is-python3

# Install OpenFL
RUN pip install --upgrade pip && \
    git clone --single-branch --branch develop https://github.com/intel/openfl.git && \
    pip install ./openfl && \
    pip install -U py-grpc-profile
RUN rm /usr/local/lib/python3.10/dist-packages/openfl-gramine/openfl.manifest.template && \
    rm /usr/local/lib/python3.10/dist-packages/openfl/component/collaborator/collaborator.py && \
    rm /usr/local/lib/python3.10/dist-packages/openfl/component/aggregator/aggregator.py && \
    rm /usr/local/lib/python3.10/dist-packages/openfl/databases/tensor_db.py && \
    rm /usr/local/lib/python3.10/dist-packages/openfl/protocols/utils.py && \
    rm /usr/local/lib/python3.10/dist-packages/openfl/federated/task/runner_pt.py

COPY openfl.manifest.template /usr/local/lib/python3.10/dist-packages/openfl-gramine/
COPY collaborator.cazzi /usr/local/lib/python3.10/dist-packages/openfl/component/collaborator/
COPY aggregator.cazzi /usr/local/lib/python3.10/dist-packages/openfl/component/aggregator/
COPY tensor_db.cazzi /usr/local/lib/python3.10/dist-packages/openfl/databases/
COPY utils.cazzi /usr/local/lib/python3.10/dist-packages/openfl/protocols/
COPY runner_pt.cazzi /usr/local/lib/python3.10/dist-packages/openfl/federated/task

RUN mv /usr/local/lib/python3.10/dist-packages/openfl/component/collaborator/collaborator.cazzi /usr/local/lib/python3.10/dist-packages/openfl/component/collaborator/collaborator.py
RUN mv /usr/local/lib/python3.10/dist-packages/openfl/component/aggregator/aggregator.cazzi /usr/local/lib/python3.10/dist-packages/openfl/component/aggregator/aggregator.py
RUN mv /usr/local/lib/python3.10/dist-packages/openfl/databases/tensor_db.cazzi /usr/local/lib/python3.10/dist-packages/openfl/databases/tensor_db.py
RUN mv /usr/local/lib/python3.10/dist-packages/openfl/protocols/utils.cazzi /usr/local/lib/python3.10/dist-packages/openfl/protocols/utils.py
RUN mv /usr/local/lib/python3.10/dist-packages/openfl/federated/task/runner_pt.cazzi /usr/local/lib/python3.10/dist-packages/openfl/federated/task/runner_pt.py

#RUN rm /usr/local/lib/python3.10/dist-packages/openfl-gramine/openfl.manifest.template
#COPY openfl.manifest.template /usr/local/lib/python3.10/dist-packages/openfl-gramine/

###### Compile gramine

RUN python3 -m pip install 'meson>=0.56' 'tomli>=1.1.0' 'tomli-w>=0.4.0'
RUN apt-get install -y libunwind8 musl-tools python3-pytest
RUN apt-get install -y libgmp-dev libmpfr-dev libmpc-dev libisl-dev
RUN apt-get install -y libprotobuf-c-dev protobuf-c-compiler \
    protobuf-compiler python3-cryptography  python3-protobuf


# ubuntu 22.04
# Add sgx repo and download libraries for gramine
RUN echo 'deb [signed-by=/etc/apt/keyrings/intel-sgx-keyring.asc arch=amd64] https://download.01.org/intel-sgx/sgx_repo/ubuntu jammy main' | tee /etc/apt/sources.list.d/intel-sgx.list
RUN wget https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key
RUN cat intel-sgx-deb.key | tee /etc/apt/keyrings/intel-sgx-keyring.asc > /dev/null

RUN apt-get update
RUN apt-get install -y libsgx-epid libsgx-quote-ex libsgx-dcap-ql
RUN apt-get install -y libsgx-urts-dbgsym libsgx-enclave-common-dbgsym libsgx-dcap-ql-dbgsym libsgx-dcap-default-qpl-dbgsym
RUN apt-get install -y libsgx-dcap-default-qpl


#ubuntu 22.04
RUN wget https://download.01.org/intel-sgx/latest/linux-latest/distro/ubuntu22.04-server/sgx_linux_x64_sdk_2.19.100.3.bin

RUN chmod +x sgx_linux_x64_sdk_2.19.100.3.bin
RUN echo 'yes' | ./sgx_linux_x64_sdk_2.19.100.3.bin
RUN source sgxsdk/environment

RUN apt-get install -y libsgx-enclave-common-dev libsgx-dcap-ql-dev libsgx-dcap-default-qpl-dev

RUN curl -fsSL https://download.01.org/intel-sgx/sgx_repo/ubuntu/intel-sgx-deb.key | apt-key add -
RUN apt-get update
RUN apt-get install -y libsgx-dcap-quote-verify-dev


# Compiule gramine
RUN git clone https://github.com/gramineproject/gramine.git

WORKDIR gramine
#ubuntu22.04
RUN meson setup build/ --buildtype=release -Ddirect=enabled -Dsgx=enabled -Dlibgomp=enabled -Dsgx_driver=upstream

RUN ninja -C build/
RUN ninja -C build/ install
RUN gramine-sgx-gen-private-key

WORKDIR /
#originale
RUN rm -rf /var/lib/apt/lists/*

# graminelibos is under this dir
ENV PYTHONPATH=/usr/local/lib/python3.10/dist-packages:/usr/lib/python3/dist-packages

#set environment variables (threads)
ENV OMP_NUM_THREADS=40
ENV OPENBLAS_NUM_THREADS=40
ENV MKL_NUM_THREADS=40
ENV VECLIB_MAXIMUM_THREADS=40
ENV NUMEXPR_NUM_THREADS=40

# ENV LC_ALL=C.UTF-8
# ENV LANG=C.UTF-8

0 replies

CasellaJr · 2023-06-23T09:00:28Z

CasellaJr
Jun 23, 2023
Author

@dimakuv Hello dimakuv, how are you?
I am finishing my experiments, but from about 1 week I have again problems of slowdown... Now is not so much, because for experiments of 5 hours, it tooks around 7-8 hours, because there is one client (each time is a different one) that has slowdown in training, so all the other clients have to wait for him. During this last week, has changed something in Gramine? Maybe some update is causing this problem.
I am on Debian 3.8.16-bullseye

0 replies

dimakuv · 2023-06-23T10:56:45Z

dimakuv
Jun 23, 2023

@CasellaJr Nothing special was changed in the last week: https://github.com/gramineproject/gramine/pulse#merged-pull-requests. I don't see anything that could affect performance.

0 replies

CasellaJr · 2023-06-23T13:08:17Z

CasellaJr
Jun 23, 2023
Author

If my paper will be accepted, for sure you will be in the acknowledgements 🤣 ❤️

0 replies

CasellaJr · 2023-06-26T09:30:10Z

CasellaJr
Jun 26, 2023
Author

Hello @dimakuv
When I start my experiments, I have these lines:

INFO     🧿 Starting a Collaborator Service.                                                                                       collaborator.py:59
           INFO     Building 🡆 Object PyTorchCIFAR10InMemory from src.ptcifar10_inmemory Module.                                                     plan.py:171
Emulating a raw syscall instruction. This degrades performance, consider patching your application to use Gramine syscall API.

Do you think that this warning represents my problem? I mean the increase of training time

0 replies

dimakuv · 2023-06-26T09:45:39Z

dimakuv
Jun 26, 2023

Emulating a raw syscall instruction. This degrades performance, consider patching your application to use Gramine syscall API.

@CasellaJr Yes, definitely. This is a perf problem. If raw syscall instructions are frequent, then it may lead to a large perf degradation. To fix this warning, you need to use the patched libgomp.so.

But we've discussed this extensively, and you had the surprising result of having worse performance with patched libgomp.so than without it. So maybe it's the best that could be achieved in your case... (Though I would spend time debugging and analyzing how it is possible that the supposedly fast patched libgomp.so makes your case slower.)

0 replies

CasellaJr · 2023-06-26T09:56:42Z

CasellaJr
Jun 26, 2023
Author

Ah ok, so this warning is referred to patched libgomp, ok.
And yes, you are right, I had worse results in that case.
If sometimes you want to test my same examples, I can send you the code.
Thank you

0 replies

CasellaJr · 2023-08-15T11:06:37Z

CasellaJr
Aug 15, 2023
Author

Hello everyone.
@dimakuv I have finished my research project, and now I want to improve my work. All the experiments of this thread have been executed on dual socket machines. Do you think that the slowdown can be caused by a "process switch" among the CPUs? In this case, it can be useful a process pinning. Is it possible to implement process pinning with Gramine?

0 replies

kailun-qin · 2023-08-16T06:48:23Z

kailun-qin
Aug 16, 2023
Maintainer

All the experiments of this thread have been executed on dual socket machines.

How many NUMA nodes does your machine have?

Do you think that the slowdown can be caused by a "process switch" among the CPUs?

Yes, the CPU topology can affect the performance, e.g., if you have several NUMA domains, Gramine may spread enclave threads and enclave memory across them, which will lead to higher memory access latencies and overall worse performance. You may probably want to restrict Gramine to run to only one NUMA domain via e.g., numactl --cpunodebind=0 --membind=0 (for further details, pls see: https://gramine.readthedocs.io/en/latest/performance.html#choice-of-sgx-machine).

Further, for such benchmark experiments, it's recommanded to limit the CPU cores where Linux will schedule enclave threads to by using core pinning (taskset -c ... gramine-sgx) or even isolating cores via isolcpus or disabling interrupts on cores via nohz_full -- these can help reduce performance variations.

0 replies

dimakuv · 2023-08-21T05:48:48Z

dimakuv
Aug 21, 2023

@CasellaJr Please see the reply from Kailun above, I have nothing to add to that reply.

0 replies

CasellaJr · 2023-08-21T14:35:55Z

CasellaJr
Aug 21, 2023
Author

Thank you guys, I will try!

0 replies

CasellaJr · 2024-03-13T09:40:40Z

CasellaJr
Mar 13, 2024
Author

Finally, our paper on Confidential Federated Learning has been accepted to the Deep Learning Security and Privacy Workshop 2024 in conjunction with IEEE Symposium on Security and Privacy. Thank you very much for all the effort you spent in helping me to overcome those heavy slowdowns.
You can find the latest version of the paper attached.
DLSP___CONFIDENTIAL_FL.pdf

I will now try to work with Confidential Federated Learning on Intel TDX.

0 replies

mkow · 2024-05-05T14:49:52Z

mkow
May 5, 2024
Maintainer

@CasellaJr: Is there anything actionable here? I.e. something we should change/fix in Gramine? Or is this just a thread for perf discussions and notes? If so, then I'll convert this into a GitHub Discussion, so it doesn't linger on our issue list.

0 replies

[PyTorch] Increase of training time by increasing epochs #2007

Description of the problem

Steps to reproduce

Expected results

Actual results

Gramine commit hash

Replies: 92 comments

CasellaJr Mar 28, 2023 Author

CasellaJr Mar 28, 2023 Author

kailun-qin Mar 28, 2023 Maintainer

CasellaJr Mar 31, 2023 Author

CasellaJr Mar 31, 2023 Author

CasellaJr Apr 2, 2023 Author

CasellaJr Apr 3, 2023 Author

mkow Apr 3, 2023 Maintainer

mkow Apr 3, 2023 Maintainer

monavij Apr 3, 2023 Collaborator

CasellaJr Apr 3, 2023 Author

CasellaJr May 16, 2023 Author

CasellaJr May 16, 2023 Author

CasellaJr Jun 23, 2023 Author

CasellaJr Jun 23, 2023 Author

CasellaJr Jun 26, 2023 Author

CasellaJr Jun 26, 2023 Author

CasellaJr Aug 15, 2023 Author

kailun-qin Aug 16, 2023 Maintainer

CasellaJr Aug 21, 2023 Author

CasellaJr Mar 13, 2024 Author

mkow May 5, 2024 Maintainer

CasellaJr
Mar 28, 2023
Author

CasellaJr
Mar 28, 2023
Author

kailun-qin
Mar 28, 2023
Maintainer

CasellaJr
Mar 31, 2023
Author

CasellaJr
Mar 31, 2023
Author

CasellaJr
Apr 2, 2023
Author

CasellaJr
Apr 3, 2023
Author

mkow
Apr 3, 2023
Maintainer

mkow
Apr 3, 2023
Maintainer

monavij
Apr 3, 2023
Collaborator

CasellaJr
Apr 3, 2023
Author

CasellaJr
May 16, 2023
Author

CasellaJr
May 16, 2023
Author

CasellaJr
Jun 23, 2023
Author

CasellaJr
Jun 23, 2023
Author

CasellaJr
Jun 26, 2023
Author

CasellaJr
Jun 26, 2023
Author

CasellaJr
Aug 15, 2023
Author

kailun-qin
Aug 16, 2023
Maintainer

CasellaJr
Aug 21, 2023
Author

CasellaJr
Mar 13, 2024
Author

mkow
May 5, 2024
Maintainer