Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Workloads (Redis, Curl, R) failing with Out of memory PAL error after new manifest syntax to define lists of SGX trusted files. #2680

Closed
jinengandhi-intel opened this issue Sep 6, 2021 · 7 comments · Fixed by gramineproject/gramine#26
Assignees
Milestone

Comments

@jinengandhi-intel
Copy link
Contributor

Description of the problem

On some systems (tried with RHEL, CentOS servers) we are seeing a regression with some of the workloads mentioned in the bug title. Not seeing the same issue on Ubuntu client as well as servers. This is a regression that was introduced with the recent commit: Define SGX allowed/trusted/protected files as TOML arrays
ddc01ba

We have tried changing the loader.pal_internal_mem_size to as high as 16G but the test still continues to fail.

Logs for the same are attached to the report here.
R_example_trace_log_RHEL.txt
redis_trace_log_RHEL.txt
Curl_trace_log_RHEL.txt

Steps to reproduce

Take a SGX enabled, build Graphene and run any of the above workloads.

Expected results

Workloads should PASS.

Actual results

@dimakuv
Copy link

dimakuv commented Sep 6, 2021

@jinengandhi-intel Could you attach a redis-server.manifest.sgx final manifest file? I wonder what is so special about RHEL/CentOS.

I assume that the difference is in how this final manifest file is generated on Ubuntu vs on RHEL/CentOS.

@dimakuv
Copy link

dimakuv commented Sep 6, 2021

@jinengandhi-intel Also could you attach redis-server.manifest.sgx file in Ubuntu (where it doesn't fail)?

I want to look at two versions of this redis-server.manifest.sgx file: one in Ubuntu, one in RHEL. Comparing them side by side, we may spot the difference, which will be the root cause of this failure.

@jinengandhi-intel
Copy link
Contributor Author

Manifest SGX file for Ubuntu are attached here. For RHEL, I am awaiting the manifest files from my colleague
R.manifest.sgx.txt

curl.manifest.sgx.txt

@aniket-intelx
Copy link

Please find the manifest.sgx files for RHEL attached.
curl.manifest.sgx.txt
R.manifest.sgx.txt
redis-server.manifest.sgx.txt

@dimakuv
Copy link

dimakuv commented Sep 7, 2021

RHEL manifest files are ~10MB in size... This feels like way too much for the initial 64MB pre-allocated by Graphene.

@aniket-intelx @jinengandhi-intel Can any of you run the failing workload (e.g., redis-server) on RHEL under GDB and find the exact place where the out of PAL memory error happens? I would assume that it happens here: https://github.com/oscarlab/graphene/blob/33a68bc4302891e8c591570942bc39394acefd23/Pal/src/host/Linux-SGX/db_main.c#L683

@anjalirai-intel
Copy link

@dimakuv
Before the mentioned commit the RHEL manifest files for curl is ~16MB in size and it did worked.

@aniket-intelx will be sharing the rest of the details soon

@dimakuv dimakuv self-assigned this Sep 8, 2021
@dimakuv dimakuv added this to the release v1.2 milestone Sep 8, 2021
@dimakuv
Copy link

dimakuv commented Sep 8, 2021

We debugged and the problem is in Graphene's pre-allocated internal PAL memory pool of 64MB.

We fail on toml_parse():

toml_table_t* manifest_root = toml_parse(manifest_addr, errbuf, sizeof(errbuf));

But we read loader.pal_internal_mem_size (which increases the internal PAL memory) only after parsing:

ret = toml_sizestring_in(g_pal_state.manifest_root, "loader.pal_internal_mem_size",

So we get a chicken-and-egg problem.

Easy solution: if we detect that the manifest size is greater than some threshold (I recommend 1MB), then we immediately increase internal PAL memory by additional 64MB.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants