Spark gramine k8s cluster frequently fails to execute pyspark task fork #2010

zzylydx · 2024-09-27T01:50:18Z

zzylydx
Sep 27, 2024

I am now running spark tasks on the bigdl gramine k8s cluster. Task configuration: driver 2-core 8GB epc, 15 executors 2-core 8GB epc. Cluster configuration: 4 virtual machines with 128GB epc. Tasks performed: pyspark. Since spark executes pyspark, it needs to fork the child process to execute the python worker, and gramine fork will create an enclave instance with the same size as the parent process, so when the task is executed, fork failure errors appear at intervals, and the process cannot be created. In the end, spark tries 4 times. The mission then failed. However, the current task configuration has not fully loaded the load (according to my current understanding, 384GB epc is consumed), so I don’t know how to tune it. Please give me some advice. Will the child process epc not be restored immediately after the grine fork is completed? At the same time, I still have a question. I currently see that the spark configuration provided by Intel is spark.python.worker.reuse=false. My current understanding is that if it is reused, there is no need to fork repeatedly. If it is not executed, it will be faster. Please help solve it. Let’s look at these questions.

monavij · 2024-09-27T17:04:21Z

monavij
Sep 27, 2024
Collaborator

Can you try running your workload with EDMM enabled in Gramine manifest? At least your forks will be much faster as it won't try and allocate all EPC memory at startup

11 replies

mkow Oct 3, 2024
Maintainer

Quoting my message to you from a few days ago:

please don't paste code as screenshots

I'm sorry, but if you don't want to take your time to write your bugreports well and ensure they are high-quality, then don't expect us spend time analyzing them. I won't be typing all the log errors from the log manually into code search...

Anyways, this is a really old code, we don't provide any support for it. You should never use it in production, because there are known security bugs in it.

zzylydx Oct 3, 2024
Author

Environmental Information

Gramine version: v1.3.1
Spark version: 3.1.3 (Intel made a modified version for Gramine based on this official version.)

Gramine manifest:

libos.entrypoint = "{{ execdir }}/bash"
loader.entrypoint = "file:{{ gramine.libos }}"
loader.pal_internal_mem_size = "512M"
loader.log_level = "{{ log_level }}"
loader.insecure__use_host_env = true
loader.env.LD_PRELOAD  = ""

sys.enable_extra_runtime_domain_names_conf = true
sys.insecure__allow_eventfd = true

#loader.insecure__use_cmdline_argv = true
loader.insecure_disable_aslr = true
#loader.argv_src_file = "file:/ppml/trusted-big-data-ml/secured_argvs"
loader.argv_src_file = "file:/ppml/secured_argvs"

sgx.remote_attestation = "dcap"
sgx.ra_client_spid = ""
sgx.allow_file_creation = true
sgx.debug = false
sgx.nonpie_binary = true
sgx.enclave_size = "8G"
sgx.thread_num = 1024
sgx.file_check_policy = "allow_all_but_log"
sgx.static_address = 1
sgx.isvprodid = 1
sgx.isvsvn = 3
sys.stack.size = "256M"

loader.env.LD_LIBRARY_PATH = "/lib:{{ arch_libdir }}:/usr{{ arch_libdir }}:/usr/lib/python3.9/lib:/usr/lib:{{ jdk_home }}:{{ jdk_home }}/lib/amd64/jli:/ppml/lib"
loader.env.PATH = "{{ execdir }}:/usr/sbin:/usr/bin:/:/sbin:/bin:{{ jdk_home }}/bin"
#loader.env.PYTHONHOME = "/usr/lib/python3.9"
loader.env.PYTHONPATH = "/usr/lib/python3.9:/usr/lib/python3.9/lib-dynload:/usr/local/lib/python3.9/dist-packages:/usr/lib/python3/dist-packages:/ppml/bigdl-ppml/src"
loader.env.JAVA_HOME = "{{ jdk_home }}"
loader.env.JAVA_OPTS = "'-Djava.library.path={{ jdk_home }}/lib -Dsun.boot.library.path={{ jdk_home }}/lib'"
loader.env.SPARK_USER = "{{ spark_user }}"
loader.env.SPARK_SCALA_VERSION = "2.12"
loader.env.SPARK_HOME = "/opt/spark"
loader.env.SPARK_CONF_DIR = "/opt/spark/conf"
loader.env.SPARK_JARS_DIR = "/opt/spark/jars"
loader.env.PYSPARK_PYTHON = "/usr/bin/python3.9"

# Python's NumPy spawns as many threads as there are CPU cores, and each thread
# consumes a chunk of memory, so on large machines 1G enclave size may be not enough.
# We limit the number of spawned threads via OMP_NUM_THREADS env variable.
loader.env.OMP_NUM_THREADS = "4"


fs.mounts = [
  { path = "{{ arch_libdir }}", uri = "file:{{ arch_libdir }}" },
  { path = "/usr{{ arch_libdir }}", uri = "file:/usr{{ arch_libdir }}" },
  { path = "{{ execdir }}", uri = "file:{{ execdir }}" },
  { path = "/usr/lib", uri = "file:/usr/lib" },
  { path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
  { path = "/usr/local", uri = "file:/usr/local" },
  { path = "/etc", uri = "file:/etc" },
  { path = "/usr/local/etc", uri = "file:/etc" },
  { path = "/opt", uri = "file:/opt" },
  { path = "/bin", uri = "file:/bin" },
  { path = "/tmp", uri = "file:/tmp" },
  { path = "/usr/lib/python3.9", uri = "file:/usr/lib/python3.9" },
  { path = "/usr/lib/python3/dist-packages", uri = "file:/usr/lib/python3/dist-packages" },
  { path = "/root/.kube/", uri = "file:/root/.kube/" },
  { path = "/root/.keras", uri = "file:/root/.keras" },
  { path = "/root/.m2", uri = "file:/root/.m2" },
  { path = "/root/.zinc", uri = "file:/root/.zinc" },
  { path = "/root/.cache", uri = "file:/root/.cache" },
  { path = "/usr/lib/gcc", uri = "file:/usr/lib/gcc" },
  { path = "/ppml", uri = "file:/ppml" },
  { path = "/root/.jupyter", uri = "file:/root/.jupyter" },
  { type = "encrypted", path = "/ppml/encrypted-fs", uri = "file:/ppml/encrypted-fs", key_name = "_sgx_mrsigner" },
  { type = "encrypted", path = "/ppml/encrypted-fsd", uri = "file:/ppml/encrypted-fsd", key_name = "sgx_data_key" },
  { type = "encrypted", path = "/ppml/data/keys/", uri = "file:/ppml/data/keys/", key_name = "_sgx_mrsigner" },
  { path = "/opt/spark/conf", uri = "file:/opt/spark/conf-copy" },
  { path = "/opt/spark/logs-conf", uri = "file:/opt/spark/logs-conf" },
  { path = "/opt/spark/pod-template", uri = "file:/opt/spark/pod-template-copy" },
  { path = "/opt/spark/work-dir", uri = "file:/opt/spark/work-dir" },
  { path = "/app/log", uri = "file:/app/log" },
]
#  { path = "{{ gramine.runtimedir() }}/etc/localtime", uri = "file:/etc" },

sgx.trusted_files = [
  "file:{{ gramine.libos }}",
  "file:{{ gramine.runtimedir() }}/",
  "file:{{ arch_libdir }}/",
  "file:/usr/{{ arch_libdir }}/",
  "file:{{ execdir }}/",
  #"file:/ppml/trusted-big-data-ml/secured_argvs",
  "file:/ppml/secured_argvs",
  "file:/ppml/scripts/ailand-kms/",
]

sgx.allowed_files = [
  "file:scripts/",
  "file:/etc",
  "file:/tmp",
  "file:{{ jdk_home }}",
  "file:/ppml",
  "file:{{ python_home }}",
  "file:/usr/lib/python3",
  "file:/usr/local/lib/python3.9/dist-packages",
  "file:/root/.keras",
  "file:/root/.m2",
  "file:/root/.zinc",
  "file:/root/.cache",
  "file:/usr/lib/gcc",
  "file:/root/.kube/config",
  "file:/etc/localtime",
  "file:/opt/spark",
  "file:/usr/bin",
  "file:/app/log",
]

Problem Description

Hello Gramine team, I'm very sorry for my previous behavior.

I'm working with Gramine on a project where we're running spark standalone mode(on k8s) task. When executing a pyspark task, spark will fork some python subprocess to execute the python code, and spark will communicate with the subprocess via socket. But we faced with some problems in the production environment.

First, due to the limitation of Gramine version, we cannot enable EDMM configuration, so all fork operations will create a child process with the same EPC size as the parent process. In this case, if the spark task resource allocation is unreasonable, it is easy to cause EPC OOM then the fork fails. We have limited the usage of system EPC resources by adjusting the configuration of spark to prevent EPC overflow.

Second, after we solve problem 1, during the experiments, we found that fork failure still occurred, and it happens occasionally. The error reported at the spark level is that the socket connection timed out. The most difficult problem for us, is that the fork process may get stuck.

Spark error message:

2024-10-02 09:47:37,144 WARN  org.apache.spark.scheduler.TaskSetManager                     - Lost task 0.0 in stage 4.0 (TID 4) (10.244.1.221 executor 3): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.newPythonProcess(PythonWorkerFactory.scala:179)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:240)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1442)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
        at java.net.ServerSocket.implAccept(ServerSocket.java:545)
        at java.net.ServerSocket.accept(ServerSocket.java:513)
        at org.apache.spark.api.python.PythonWorkerFactory.newPythonProcess(PythonWorkerFactory.scala:169)
        ... 15 more

We also changed to loader.log_level = "all", and attach the resulting logs.

Fork successful log:

...
debug: memory entry [0x3e007668]: 0x1c4bae000-0x1c4bb0000
debug: memory entry [0x3e0075d0]: 0x1c4bb0000-0x1c4bb2000
debug: restored memory from checkpoint
debug: receiving 360 PAL handles
debug: restoring checkpoint at 0x3e000000 rebased from 0x3e000000
debug: Successfully retrieved special key "_sgx_mrsigner"
[P3:T105:java] trace: ---- return from futex(...) = -110
[P3:T105:java] trace: ---- futex(0x1b811d428, FUTEX_PRIVATE|FUTEX_WAKE, 1, 0, 0x1b811d474, 1) ...
[P3:T105:java] trace: ---- return from futex(...) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5cc0) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5c80) = 0x0
[P3:T105:java] trace: ---- futex(0x1b811d478, FUTEX_PRIVATE|FUTEX_WAIT_BITSET, 0, 0x1c08e5c70, 0, -1) ...
[P3:T105:java] trace: ---- return from futex(...) = -110
[P3:T105:java] trace: ---- futex(0x1b811d428, FUTEX_PRIVATE|FUTEX_WAKE, 1, 0, 0x1b811d474, 1) ...
[P3:T105:java] trace: ---- return from futex(...) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5cc0) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5c80) = 0x0
[P3:T105:java] trace: ---- futex(0x1b811d478, FUTEX_PRIVATE|FUTEX_WAIT_BITSET, 0, 0x1c08e5c70, 0, -1) ...
debug: Creating pipe: pipe.srv:c89efc8b4ea0b35d6b545ec5507273f4232907ff0f76d2e21810436c00af522b
[P4:T127:java] debug: successfully restored checkpoint at 0x3e000000 - 0x3e128a20
[P4:T127:java] debug: Creating pipe: pipe.srv:dcebcca739a5b75edb02fac64675e3018f6b5017eba31953f43617c21cbea21a
[P4:T127:java] debug: Creating pipe: pipe.srv:4
[P4:T127:java] debug: Creating pipe: pipe.srv:beaf795dd51b83f11709c536192c8cee637688034da3d4a152abf7759f374749
[P4:libos] debug: IPC worker started
[P4:T127:java] debug: ipc_get_id_owner: sending a request: 0
[P4:T127:java] debug: Sending ipc message to 1
[P4:T127:java] debug: Waiting for a response to 1
[P1:libos] debug: IPC worker: received IPC message from 4: code=6 size=21 seq=1
[P1:libos] debug: ipc_get_id_owner_callback: find_id_owner(0): 0
[P1:libos] debug: Sending ipc message to 4
[P4:libos] debug: IPC worker: received IPC message from 1: code=0 size=21 seq=1
[P4:libos] debug: Got an IPC response from 1, seq: 1
[P4:T127:java] debug: Waiting finished: 0
[P4:T127:java] debug: ipc_get_id_owner: got a response: 0
[P3:T105:java] trace: ---- return from futex(...) = -110
[P3:T105:java] trace: ---- futex(0x1b811d428, FUTEX_PRIVATE|FUTEX_WAKE, 1, 0, 0x1b811d474, 1) ...
[P3:T105:java] trace: ---- return from futex(...) = 0x0
[P4:T127:java] debug: LibOS initialized
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c08e5c80) = 0x0
[P3:T105:java] trace: ---- futex(0x1b811d478, FUTEX_PRIVATE|FUTEX_WAIT_BITSET, 0, 0x1c08e5c70, 0, -1) ...
[P4:T127:java] trace: ---- close(341) = 0x0
[P3:T120:java] trace: ---- return from vfork(...) = 0x7f
[P4:T127:java] trace: ---- close(342) = 0x0
[P3:T120:java] trace: ---- close(349) = 0x0
[P4:T127:java] trace: ---- close(344) = 0x0
[P3:T120:java] trace: ---- read(348, 0x1a7bfc994, 0x4) ...
[P4:T127:java] trace: ---- close(346) = 0x0
[P4:T127:java] trace: ---- close(347) = 0x0
[P4:T127:java] trace: ---- close(348) = 0x0
[P4:T127:java] trace: ---- dup2(340, 0) = 0x0
[P4:T127:java] trace: ---- close(340) = 0x0
[P4:T127:java] trace: ---- dup2(343, 1) = 0x1
[P4:T127:java] trace: ---- close(343) = 0x0
[P4:T127:java] trace: ---- dup2(345, 2) = 0x2
[P4:T127:java] trace: ---- close(345) = 0x0
[P4:T127:java] trace: ---- dup2(349, 3) = 0x3
[P4:T127:java] trace: ---- close(349) = 0x0
[P4:T127:java] trace: ---- close(4) = 0x0
[P4:T127:java] trace: ---- close(5) = 0x0
[P4:T127:java] trace: ---- openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|0x90800, 0000) = 0x4
[P4:T127:java] trace: ---- newfstatat(4, "", 0x1a7bfc810, 4096) = 0x0
[P4:T127:java] trace: ---- getdents64(4, 0x447e6660, 0x8000) = 0x2aa0
[P4:T127:java] trace: ---- close(339) = 0x0
[P4:T127:java] trace: ---- close(338) = 0x0
[P4:T127:java] trace: ---- close(337) = 0x0
[P4:T127:java] trace: ---- close(336) = 0x0
[P4:T127:java] trace: ---- close(335) = 0x0
[P4:T127:java] trace: ---- close(334) = 0x0
[P4:T127:java] trace: ---- close(333) = 0x0
[P4:T127:java] trace: ---- close(332) = 0x0
[P4:T127:java] trace: ---- close(331) = 0x0
[P4:T127:java] trace: ---- close(330) = 0x0
[P4:T127:java] trace: ---- close(329) = 0x0
[P4:T127:java] trace: ---- close(328) = 0x0
[P4:T127:java] trace: ---- close(327) = 0x0
[P4:T127:java] trace: ---- close(326) = 0x0
[P4:T127:java] trace: ---- close(325) = 0x0
[P4:T127:java] trace: ---- close(324) = 0x0
[P4:T127:java] trace: ---- close(323) = 0x0
[P4:T127:java] trace: ---- close(322) = 0x0
[P4:T127:java] trace: ---- close(321) = 0x0
[P4:T127:java] trace: ---- close(320) = 0x0
[P4:T127:java] trace: ---- close(319) = 0x0
[P4:T127:java] trace: ---- close(318) = 0x0
[P4:T127:java] trace: ---- close(317) = 0x0
[P4:T127:java] trace: ---- close(316) = 0x0
[P4:T127:java] trace: ---- close(315) = 0x0
[P4:T127:java] trace: ---- close(314) = 0x0
[P4:T127:java] trace: ---- close(313) = 0x0
[P4:T127:java] trace: ---- close(312) = 0x0
[P4:T127:java] trace: ---- close(311) = 0x0
[P4:T127:java] trace: ---- close(310) = 0x0
[P4:T127:java] trace: ---- close(309) = 0x0
[P4:T127:java] trace: ---- close(308) = 0x0
[P4:T127:java] trace: ---- close(307) = 0x0
[P4:T127:java] trace: ---- close(306) = 0x0
[P4:T127:java] trace: ---- close(305) = 0x0
[P4:T127:java] trace: ---- close(304) = 0x0
[P4:T127:java] trace: ---- close(303) = 0x0
[P4:T127:java] trace: ---- close(302) = 0x0
[P4:T127:java] trace: ---- close(301) = 0x0
[P4:T127:java] trace: ---- close(300) = 0x0
[P4:T127:java] trace: ---- close(299) = 0x0
[P4:T127:java] trace: ---- close(298) = 0x0
[P4:T127:java] trace: ---- close(297) = 0x0
[P4:T127:java] trace: ---- close(296) = 0x0
[P4:T127:java] trace: ---- close(295) = 0x0
[P4:T127:java] trace: ---- close(294) = 0x0
[P4:T127:java] trace: ---- close(293) = 0x0
[P4:T127:java] trace: ---- close(292) = 0x0
[P4:T127:java] trace: ---- close(291) = 0x0
[P4:T127:java] trace: ---- close(290) = 0x0
[P4:T127:java] trace: ---- close(289) = 0x0
...
[P4:T127:java] trace: ---- close(24) = 0x0
[P4:T127:java] trace: ---- close(23) = 0x0
[P4:T127:java] trace: ---- close(22) = 0x0
[P4:T127:java] trace: ---- close(21) = 0x0
[P4:T127:java] trace: ---- close(20) = 0x0
[P4:T127:java] trace: ---- close(19) = 0x0
[P4:T127:java] trace: ---- close(18) = 0x0
[P4:T127:java] trace: ---- close(17) = 0x0
[P4:T127:java] trace: ---- close(16) = 0x0
[P4:T127:java] trace: ---- close(15) = 0x0
[P4:T127:java] trace: ---- close(14) = 0x0
[P4:T127:java] trace: ---- close(13) = 0x0
[P4:T127:java] trace: ---- close(12) = 0x0
[P4:T127:java] trace: ---- close(11) = 0x0
[P4:T127:java] trace: ---- close(10) = 0x0
[P4:T127:java] trace: ---- close(9) = 0x0
[P4:T127:java] trace: ---- close(8) = 0x0
[P4:T127:java] trace: ---- close(7) = 0x0
[P4:T127:java] trace: ---- close(6) = 0x0
[P4:T127:java] trace: ---- getdents64(4, 0x447e6660, 0x8000) = 0x0
[P4:T127:java] trace: ---- close(4) = 0x0
[P4:T127:java] trace: ---- fcntl(3, F_SETFD, 0x1) = 0x0
[P4:T127:java] trace: ---- execve("/usr/bin/python3.9", [/usr/bin/python3.9,-m,pyspark.worker,], [PATH=/usr/bin:/usr/sbin:/usr/bin:/:/sbin:/bin:/opt/jdk8/bin,MY_SPARK_UI_ZZY_15_PORT=tcp://10.111.188.36:4440,]) ...
[P3:T120:java] trace: ---- return from read(...) = 0x0
[P3:T120:java] trace: ---- close(340) = 0x0
[P3:T120:java] trace: ---- close(343) = 0x0
[P3:T120:java] trace: ---- close(345) = 0x0
[P3:T120:java] trace: ---- close(348) = 0x0
[P3:T120:java] trace: ---- close(346) = 0x0
[P3:T120:java] trace: ---- close(347) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfc3f0) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfc490) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfc410) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfc410) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfb930) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfb680) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfb6a0) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfb490) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfb540) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfb680) = 0x0
[P3:T120:java] trace: ---- clock_gettime(1, 0x1a7bfb680) = 0x0
...

Fork failed log:

debug: memory entry [0x3e007838]: 0x1c521f000-0x1c5220000
debug: memory entry [0x3e007668]: 0x1c5255000-0x1c5257000
debug: memory entry [0x3e0075d0]: 0x1c5257000-0x1c5259000
debug: restored memory from checkpoint
debug: receiving 364 PAL handles
debug: restoring checkpoint at 0x3e000000 rebased from 0x3e000000
debug: Successfully retrieved special key "_sgx_mrsigner"
[P3:T105:java] trace: ---- return from futex(...) = -110
[P3:T105:java] trace: ---- futex(0x1b8115828, FUTEX_PRIVATE|FUTEX_WAKE, 1, 0, 0x1b8115874, 1) ...
[P3:T105:java] trace: ---- return from futex(...) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c10e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c10e5cc0) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c10e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c10e5c80) = 0x0
[P3:T105:java] trace: ---- futex(0x1b8115878, FUTEX_PRIVATE|FUTEX_WAIT_BITSET, 0, 0x1c10e5c70, 0, -1) ...
[P3:T105:java] trace: ---- return from futex(...) = -110
[P3:T105:java] trace: ---- futex(0x1b8115828, FUTEX_PRIVATE|FUTEX_WAKE, 1, 0, 0x1b8115874, 1) ...
[P3:T105:java] trace: ---- return from futex(...) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c10e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c10e5cc0) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c10e5d60) = 0x0
[P3:T105:java] trace: ---- clock_gettime(1, 0x1c10e5c80) = 0x0
[P3:T105:java] trace: ---- futex(0x1b8115878, FUTEX_PRIVATE|FUTEX_WAIT_BITSET, 0, 0x1c10e5c70, 0, -1) ...
debug: Creating pipe: pipe.srv:00361776a4f3cb339730d032009d3d37f2c72a688f2759b11c6df12f3928bfa9
[P4:T188:java] debug: successfully restored checkpoint at 0x3e000000 - 0x3e111f78
[P4:T188:java] debug: Creating pipe: pipe.srv:4535af7e4304e9940dbeab50f7c85b256c0aca62d7276c85f03257afdf24796a
[P4:T188:java] debug: Creating pipe: pipe.srv:4
[P4:T188:java] debug: Creating pipe: pipe.srv:5b20aad53b8089e698d0de00cdd1b74c06c9af4a7e807975fc4753536fb7340c
[P4:libos] debug: IPC worker started
[P4:T188:java] debug: ipc_get_id_owner: sending a request: 0
[P4:T188:java] debug: Sending ipc message to 1
[P4:T188:java] debug: Waiting for a response to 1
[P1:libos] debug: IPC worker: received IPC message from 4: code=6 size=21 seq=1
[P1:libos] debug: ipc_get_id_owner_callback: find_id_owner(0): 0
[P1:libos] debug: Sending ipc message to 4
[P4:libos] debug: IPC worker: received IPC message from 1: code=0 size=21 seq=1
[P4:libos] debug: Got an IPC response from 1, seq: 1
[P4:T188:java] debug: Waiting finished: 0
[P4:T188:java] debug: ipc_get_id_owner: got a response: 0
[P4:T188:java] debug: LibOS initialized
[P4:T188:java] trace: ---- close(345) = 0x0
[P4:T188:java] trace: ---- close(346) = 0x0
[P3:T131:java] trace: ---- return from vfork(...) = 0xbc
[P4:T188:java] trace: ---- close(348) = 0x0
[P4:T188:java] trace: ---- close(350) = 0x0
P3:T131:java] trace: ---- close(353) = 0x0
[P4:T188:java] trace: ---- close(351) = 0x0
[P3:T131:java] trace: ---- read(352, 0x1a7bfc1f4, 0x4) ...
[P4:T188:java] trace: ---- close(352) = 0x0
[P4:T188:java] trace: ---- dup2(344, 0) = 0x0
[P4:T188:java] trace: ---- close(344) = 0x0
[P4:T188:java] trace: ---- dup2(347, 1) = 0x1
[P4:T188:java] trace: ---- close(347) = 0x0
[P4:T188:java] trace: ---- dup2(349, 2) = 0x2
[P4:T188:java] trace: ---- close(349) = 0x0
[P4:T188:java] trace: ---- dup2(353, 3) = 0x3
[P4:T188:java] trace: ---- close(353) = 0x0
[P4:T188:java] trace: ---- close(4) = 0x0
[P4:T188:java] trace: ---- close(5) = 0x0
[P4:T188:java] trace: ---- openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|0x90800, 0000) = 0x4
[P4:T188:java] trace: ---- newfstatat(4, "", 0x1a7bfc070, 4096) = 0x0
[P4:T188:java] trace: ---- writev(2, 0x1a7bfbe30, 2) = 0x30
[P4:T188:java] trace: ---- mmap(0, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0x0) ...
[P4:T188:java] trace: ---- return from mmap(...) = 0x1c3a27000
[P4:T188:java] trace: ---- rt_sigprocmask(UNBLOCK, [SIGABRT,], NULL, 0x8) = 0x0
[P4:T188:java] trace: ---- gettid() = 0xbc
[P4:T188:java] trace: ---- getpid() = 0xbc
[P4:T188:java] trace: ---- tgkill(188, 188, [SIGABRT]) = 0x0
[P4:T188:java] debug: killed by signal 6
[P4:T188:java] debug: Sending ipc message to 1
[P4:T188:java] debug: Waiting for a response to 2
[P1:libos] debug: IPC worker: received IPC message from 4: code=17 size=21 seq=2
[P1:libos] debug: clearing POSIX locks for pid 188
[P1:libos] debug: Sending ipc message to 4
[P4:libos] debug: IPC worker: received IPC message from 1: code=0 size=21 seq=2
[P4:libos] debug: Got an IPC response from 1, seq: 2
[P4:T188:java] debug: Waiting finished: 0
[P3:T131:java] trace: ---- return from read(...) = 0x0
[P3:T131:java] trace: ---- close(344) = 0x0
[P3:T131:java] trace: ---- close(347) = 0x0
[P3:T131:java] trace: ---- close(349) = 0x0
...
[P3:T131:java] trace: ---- close(347) = 0x0
[P3:T131:java] trace: ---- close(349) = 0x0
[P3:T131:java] trace: ---- close(352) = 0x0
[P3:T131:java] trace: ---- close(350) = 0x0
[P3:T131:java] trace: ---- close(351) = 0x0
[P4:T188:java] debug: Sending ipc message to 3
[P3:T131:java] trace: ---- clock_gettime(1, 0x1a7bfbc50) = 0x0
[P4:T188:java] debug: sync client shutdown: closing handles
[P4:T188:java] debug: sync client shutdown: waiting for confirmation
[P4:T188:java] debug: sync client shutdown: finished
[P4:T188:java] debug: ipc_release_id_range: sending a request: [188..188]
[P4:T188:java] debug: Sending ipc message to 1
[P3:libos] debug: IPC worker: received IPC message from 4: code=2 size=37 seq=0
[P3:libos] debug: IPC callback from 4: IPC_MSG_CHILDEXIT(3, 188, 0, 134)
[P4:T188:java] debug: ipc_release_id_range: ipc_send_message: 0
[P4:libos] debug: IPC worker: exiting worker thread
[P1:libos] debug: IPC worker: received IPC message from 4: code=4 size=25 seq=0
[P3:T3:java] trace: ---- return from futex(...) = -512
[P1:libos] debug: ipc_release_id_range_callback: release_id_range(188..188)
[P3:libos] debug: Child process (pid: 188) died
[P3:T3:java] trace: ---- futex(0x1c3c21910, FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 97, 0, 0, -1) ...
[P3:T3:java] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P3:T3:java] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P4:T188:java] debug: process 4 exited with status 134
debug: PalProcessExit: Returning exit code 134
[P3:T131:java] trace: ---- clock_gettime(1, 0x1a7bfbcf0) = 0x0
[P3:T131:java] trace: ---- clock_gettime(1, 0x1a7bfbc70) = 0x0

For complete logs, see the compressed package, which contains complete logs(success and failed) of spark executors.

thanks for help :)
gramine-logs.tar.gz

woju Oct 3, 2024
Collaborator

I'm sorry to be blunt, but @mkow point apparently did not get across, so at the risk of repeating ourselves, I'll state it once more: please don't use old insecure versions. The correct solution is to try to reproduce on a latest version, and if the bug persists, then we'll take a second look. I'm not willing to even look at the logs, it's pure waste of our time, not least because even if we found a problem there, we wouldn't release 1.3.2 at this point, we don't have resources to maintain stable branches. Any fix would be included only in latest version (I hope we'll release 1.8 soon), so to get the fix you'd have to update anyway.

It has been deployed to the customer's production environment, and we can no longer switch versions.

Please consider notifying your customer that the deployment is currently vulnerable to all kinds of data leaks and RCE because of this bug (among others): #1796. TL;DR is, after fork, trusted_files are no longer, they behave like allowed_files, so an attacker can substitute any file, looking at your manifest that would be any binary and library in /usr, so RCE is likely. I think given control of the outside OS I would be able to extract your encrypted files from /ppml/ and any other secrets that your customer might care to hold in the enclaves.

This was fixed in 1.6.2 BTW: https://github.com/gramineproject/gramine/releases/tag/v1.6.2

I hope that information will help you persuade whoever is responsible for the decision to update.

zzylydx Oct 5, 2024
Author

Thank you very much for your reply in your busy schedule. We will decide whether to try the latest version of Gramine after evaluation. If there still have the problem, I will continue to feedback. Thank you very much for your support.

zzylydx Oct 11, 2024
Author

Gramine v1.7 with spark Problems

Gramine Version: v1.7

Gramine Manifest:

libos.entrypoint = "{{ execdir }}/bash"
loader.entrypoint = "file:{{ gramine.libos }}"
#loader.pal_internal_mem_size = "512M"
loader.log_level = "{{ log_level }}"
loader.insecure__use_host_env = true
loader.env.LD_PRELOAD  = ""

sys.enable_extra_runtime_domain_names_conf = true
sys.insecure__allow_eventfd = true

#loader.insecure__use_cmdline_argv = true
loader.insecure_disable_aslr = true
#loader.argv_src_file = "file:/ppml/trusted-big-data-ml/secured_argvs"
loader.argv_src_file = "file:/ppml/secured_argvs"

sgx.remote_attestation = "dcap"
sgx.ra_client_spid = ""
sgx.allow_file_creation = true
sgx.debug = false
#sgx.nonpie_binary = true
sgx.enclave_size = "16G"
sgx.max_threads = 1024
sgx.file_check_policy = "allow_all_but_log"
#sgx.static_address = 1
sgx.isvprodid = 1
sgx.isvsvn = 3
sys.stack.size = "64M"

# https://github.com/gramineproject/examples/blob/v1.7/openjdk/java.manifest.template
# https://github.com/gramineproject/gramine/discussions/1704
sgx.use_exinfo = true

loader.env.LD_LIBRARY_PATH = "/lib:{{ arch_libdir }}:/usr{{ arch_libdir }}:/usr/lib/python3.9/lib:/usr/lib:{{ jdk_home }}:{{ jdk_home }}/lib/amd64/jli:/ppml/lib"
loader.env.PATH = "{{ execdir }}:/usr/sbin:/usr/bin:/:/sbin:/bin:{{ jdk_home }}/bin"
#loader.env.PYTHONHOME = "/usr/lib/python3.9"
loader.env.PYTHONPATH = "/usr/lib/python3.9:/usr/lib/python3.9/lib-dynload:/usr/local/lib/python3.9/dist-packages:/usr/lib/python3/dist-packages:/ppml/bigdl-ppml/src"
loader.env.JAVA_HOME = "{{ jdk_home }}"
loader.env.JAVA_OPTS = "'-Djava.library.path={{ jdk_home }}/lib -Dsun.boot.library.path={{ jdk_home }}/lib'"
loader.env.SPARK_USER = "{{ spark_user }}"
loader.env.SPARK_SCALA_VERSION = "2.12"
loader.env.SPARK_HOME = "/opt/spark"
loader.env.SPARK_CONF_DIR = "/opt/spark/conf"
loader.env.SPARK_JARS_DIR = "/opt/spark/jars"
loader.env.PYSPARK_PYTHON = "/usr/bin/python3.9"

# Python's NumPy spawns as many threads as there are CPU cores, and each thread
# consumes a chunk of memory, so on large machines 1G enclave size may be not enough.
# We limit the number of spawned threads via OMP_NUM_THREADS env variable.
loader.env.OMP_NUM_THREADS = "4"


fs.mounts = [
  { path = "{{ arch_libdir }}", uri = "file:{{ arch_libdir }}" },
  { path = "/usr{{ arch_libdir }}", uri = "file:/usr{{ arch_libdir }}" },
  { path = "{{ execdir }}", uri = "file:{{ execdir }}" },
  { path = "/usr/lib", uri = "file:/usr/lib" },
  { path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
  { path = "/usr/local", uri = "file:/usr/local" },
  { path = "/etc", uri = "file:/etc" },
  { path = "/usr/local/etc", uri = "file:/etc" },
  { path = "/opt", uri = "file:/opt" },
  { path = "/bin", uri = "file:/bin" },
  { path = "/tmp", uri = "file:/tmp" },
  { path = "/usr/lib/python3.9", uri = "file:/usr/lib/python3.9" },
  { path = "/usr/lib/python3/dist-packages", uri = "file:/usr/lib/python3/dist-packages" },
  { path = "/root/.kube/", uri = "file:/root/.kube/" },
  { path = "/root/.keras", uri = "file:/root/.keras" },
  { path = "/root/.m2", uri = "file:/root/.m2" },
  { path = "/root/.zinc", uri = "file:/root/.zinc" },
  { path = "/root/.cache", uri = "file:/root/.cache" },
  { path = "/usr/lib/gcc", uri = "file:/usr/lib/gcc" },
  { path = "/ppml", uri = "file:/ppml" },
  { path = "/root/.jupyter", uri = "file:/root/.jupyter" },
  { type = "encrypted", path = "/ppml/encrypted-fs", uri = "file:/ppml/encrypted-fs", key_name = "_sgx_mrsigner" },
  { type = "encrypted", path = "/ppml/encrypted-fsd", uri = "file:/ppml/encrypted-fsd", key_name = "sgx_data_key" },
  { type = "encrypted", path = "/ppml/data/keys/", uri = "file:/ppml/data/keys/", key_name = "_sgx_mrsigner" },
  { path = "/opt/spark/conf", uri = "file:/opt/spark/conf-copy" },
  { path = "/opt/spark/logs-conf", uri = "file:/opt/spark/logs-conf" },
  { path = "/opt/spark/pod-template", uri = "file:/opt/spark/pod-template-copy" },
  { path = "/opt/spark/work-dir", uri = "file:/opt/spark/work-dir" },
  { path = "/app/log", uri = "file:/app/log" },
]
#  { path = "{{ gramine.runtimedir() }}/etc/localtime", uri = "file:/etc" },

sgx.trusted_files = [
  "file:{{ gramine.libos }}",
  "file:{{ gramine.runtimedir() }}/",
  "file:{{ arch_libdir }}/",
  "file:/usr/{{ arch_libdir }}/",
  "file:{{ execdir }}/",
  #"file:/ppml/trusted-big-data-ml/secured_argvs",
  "file:/ppml/secured_argvs",
  "file:/ppml/scripts/ailand-kms/",
]

sgx.allowed_files = [
  "file:scripts/",
  "file:/etc",
  "file:/tmp",
  "file:{{ jdk_home }}",
  "file:/ppml",
  "file:{{ python_home }}",
  "file:/usr/lib/python3",
  "file:/usr/local/lib/python3.9/dist-packages",
  "file:/root/.keras",
  "file:/root/.m2",
  "file:/root/.zinc",
  "file:/root/.cache",
  "file:/usr/lib/gcc",
  "file:/root/.kube/config",
  "file:/etc/localtime",
  "file:/opt/spark",
  "file:/usr/bin",
  "file:/app/log",
]

sys.ioctl_structs.ifreq = [
  { size = 16, direction = "out" }, # ifr_name
  { size = 2, direction = "in" },   # ifr_flags
]

# below IOCTL is for socket ioctl tests (e.g. `sockioctl01`); note that there is no additional
# sanitization of these IOCTLs but this is only for testing anyway
sys.ioctl_structs.ifconf = [
  # When ifc_req is NULL, direction of ifc_len is out. Otherwise, direction is in.
  { size = 4, direction = "inout", name = "ifc_len" },  # ifc_len
  { size = 4, direction = "none" },                     # padding
  { ptr = [ { size = "ifc_len", direction = "in" } ] }, # ifc_req
]

sys.allowed_ioctls = [
  { request_code = 0x8912, struct = "ifconf" }, # SIOCGIFCONF
  { request_code = 0x8913, struct = "ifreq" },  # SIOCGIFFLAGS
]

When we switched the Gramine version to v1.7, many errors occurred when executing spark tasks. We guess that the problems encountered are most likely caused by improper configuration of Gramine, but because we are not familiar with the latest version, we don't know how to configure it. Please give some guidance or test methods based on my scenario and some simple error logs. There is nothing we can do about these problems right now.

First, we encountered an IOCTL error, and the error was solved by referring to the Gramine github issue: Manifest definition to allow SIOCGIFCONF IOCTL #1733 .

java.net.SocketException: Inappropriate ioctl for device (ioctl(SIOCGIFCONF) failed)

We added the following configuration to the manifest:

sys.ioctl_structs.ifreq = [
  { size = 16, direction = "out" }, # ifr_name
  { size = 2, direction = "in" },   # ifr_flags
]

# below IOCTL is for socket ioctl tests (e.g. `sockioctl01`); note that there is no additional
# sanitization of these IOCTLs but this is only for testing anyway
sys.ioctl_structs.ifconf = [
  # When ifc_req is NULL, direction of ifc_len is out. Otherwise, direction is in.
  { size = 4, direction = "inout", name = "ifc_len" },  # ifc_len
  { size = 4, direction = "none" },                     # padding
  { ptr = [ { size = "ifc_len", direction = "in" } ] }, # ifc_req
]

sys.allowed_ioctls = [
  { request_code = 0x8912, struct = "ifconf" }, # SIOCGIFCONF
  { request_code = 0x8913, struct = "ifreq" },  # SIOCGIFFLAGS
]

Second, an Amazon S3 Exception error was reported at the beginning of the task, and we are sure that there are no problems with S3 related services and dependencies. Because if we switch back to the old version of Gramine image, this error will disappear immediately. The only difference between the two images is the version of Gramine. At the same time, there is no abnormality in Gramine's log. The attachment is Gramine's log_level=all logs.
yeqc-pyspark-245ba29275a6a695-driver.log.tar.gz

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/ppml/spark-3.1.3/jars/slf4j-reload4j-1.7.35.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/ppml/spark-3.1.3/jars/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
2024-10-11 14:31:32,011 INFO  org.apache.spark.SecurityManager                              - Changing view acls to: root
2024-10-11 14:31:34,837 INFO  org.apache.spark.SecurityManager                              - Changing modify acls to: root
2024-10-11 14:31:35,545 INFO  org.apache.spark.SecurityManager                              - Changing view acls groups to: 
2024-10-11 14:31:36,005 INFO  org.apache.spark.SecurityManager                              - Changing modify acls groups to: 
2024-10-11 14:31:36,329 INFO  org.apache.spark.SecurityManager                              - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2024-10-11 14:33:41,455 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig                 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-10-11 14:33:51,100 INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl             - Scheduled Metric snapshot period at 10 second(s).
2024-10-11 14:33:51,110 INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl             - s3a-file-system metrics system started
Exception in thread "main" java.nio.file.AccessDeniedException: s3a://zlg-contract-lite/fileupload/spark-upload-961b63e6-d31f-4065-8fef-7b695dd03de4/rdd_test.py: getFileStatus on s3a://zlg-contract-lite/fileupload/spark-upload-961b63e6-d31f-4065-8fef-7b695dd03de4/rdd_test.py: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null:403 Forbidden
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:230)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:151)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2198)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2163)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2102)
	at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1717)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.isFile(S3AFileSystem.java:3006)
	at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:802)
	at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:779)
	at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:138)
	at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:910)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1069)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1078)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1271)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$4(S3AFileSystem.java:1249)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1246)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2183)
	... 17 more
2024-10-11 14:38:55,566 INFO  org.apache.spark.util.ShutdownHookManager                     - Shutdown hook called
2024-10-11 14:38:56,242 INFO  org.apache.spark.util.ShutdownHookManager                     - Deleting directory /tmp/spark-6be3710f-fb9d-490d-90c8-9188684492f8
2024-10-11 14:38:57,779 INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl             - Stopping s3a-file-system metrics system...
2024-10-11 14:38:57,884 INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl             - s3a-file-system metrics system stopped.
2024-10-11 14:38:58,273 INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl             - s3a-file-system metrics system shutdown complete.

Third, When we avoid obtaining remote files through S3 and instead use locally mounted files to execute spark tasks, some network errors will appear in the executors.

    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/ppml/spark-3.1.3/jars/slf4j-reload4j-1.7.35.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/ppml/spark-3.1.3/jars/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
     INFO [2024-10-11 09:02:46,247] ({main} Utils.scala[initDaemon]:2648) - Started daemon with process name: 3@spark-pi-d11852927ad74288-exec-3
     INFO [2024-10-11 09:02:51,291] ({main} Logging.scala[logInfo]:57) - Registering signal handler for TERM
     INFO [2024-10-11 09:02:51,714] ({main} Logging.scala[logInfo]:57) - Registering signal handler for HUP
     INFO [2024-10-11 09:02:51,714] ({main} Logging.scala[logInfo]:57) - Registering signal handler for INT
     INFO [2024-10-11 09:05:07,213] ({main} Logging.scala[logInfo]:57) - Changing view acls to: root
     INFO [2024-10-11 09:05:07,398] ({main} Logging.scala[logInfo]:57) - Changing modify acls to: root
     INFO [2024-10-11 09:05:07,580] ({main} Logging.scala[logInfo]:57) - Changing view acls groups to: 
     INFO [2024-10-11 09:05:07,707] ({main} Logging.scala[logInfo]:57) - Changing modify acls groups to: 
     INFO [2024-10-11 09:05:07,988] ({main} Logging.scala[logInfo]:57) - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
     WARN [2024-10-11 09:06:31,146] ({netty-rpc-connection-0} Logging.scala[logWarning]:69) - NettyRpcEnv.createClient address: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
     INFO [2024-10-11 09:06:31,577] ({netty-rpc-connection-0} TransportClientFactory.java[createClient]:233) - TransportClientFactory.createClient2 remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
     INFO [2024-10-11 09:06:31,668] ({netty-rpc-connection-0} TransportClientFactory.java[createClient]:154) - TransportClientFactory.createClient remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
     INFO [2024-10-11 09:06:38,883] ({netty-rpc-connection-0} TransportClientFactory.java[createClient]:190) - TransportClientFactory.createClient resolvedAddress: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
     INFO [2024-10-11 09:06:39,654] ({netty-rpc-connection-0} TransportClientFactory.java[createClient]:194) - DNS resolution failed for yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078 took 6619 ms
     WARN [2024-10-11 09:06:49,632] ({netty-rpc-connection-0} MacAddressUtil.java[defaultMachineId]:142) - Failed to find a usable hardware address from the network interfaces; using random bytes: 63:76:62:17:d6:9c:03:0a
     WARN [2024-10-11 09:07:19,477] ({netty-rpc-connection-1} Logging.scala[logWarning]:69) - NettyRpcEnv.createClient address: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
     INFO [2024-10-11 09:07:19,477] ({netty-rpc-connection-1} TransportClientFactory.java[createClient]:233) - TransportClientFactory.createClient2 remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
     INFO [2024-10-11 09:07:19,555] ({netty-rpc-connection-1} TransportClientFactory.java[createClient]:154) - TransportClientFactory.createClient remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
     INFO [2024-10-11 09:07:19,664] ({netty-rpc-connection-1} TransportClientFactory.java[createClient]:190) - TransportClientFactory.createClient resolvedAddress: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
     INFO [2024-10-11 09:07:19,706] ({netty-rpc-connection-1} TransportClientFactory.java[createClient]:197) - DNS resolution failed for yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078 took 73 ms
     WARN [2024-10-11 09:07:22,097] ({netty-rpc-connection-2} Logging.scala[logWarning]:69) - NettyRpcEnv.createClient address: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
     INFO [2024-10-11 09:07:22,162] ({netty-rpc-connection-2} TransportClientFactory.java[createClient]:233) - TransportClientFactory.createClient2 remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
     INFO [2024-10-11 09:07:22,178] ({netty-rpc-connection-2} TransportClientFactory.java[createClient]:154) - TransportClientFactory.createClient remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
     INFO [2024-10-11 09:07:22,312] ({netty-rpc-connection-2} TransportClientFactory.java[createClient]:190) - TransportClientFactory.createClient resolvedAddress: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
     INFO [2024-10-11 09:07:22,312] ({netty-rpc-connection-2} TransportClientFactory.java[createClient]:197) - DNS resolution failed for yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078 took 45 ms
    Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
    	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:402)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:391)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
    Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
    	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
    	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:422)
    	at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
    	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
    	at scala.collection.immutable.Range.foreach(Range.scala:158)
    	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:420)
    	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
    	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at javax.security.auth.Subject.doAs(Subject.java:422)
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    	... 4 more
    Caused by: java.io.IOException: Failed to connect to yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
    	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:294)
    	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:221)
    	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:235)
    	at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:205)
    	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
    	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    Caused by: java.net.UnknownHostException: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc
    	at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
    	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
    	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
    	at java.net.InetAddress.getByName(InetAddress.java:1077)
    	at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)
    	at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153)
    	at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41)
    	at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61)
    	at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53)
    	at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55)
    	at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31)
    	at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106)
    	at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206)
    	at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46)
    	at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180)
    	at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166)
    	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
    	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
    	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
    	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
    	at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604)
    	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
    	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
    	at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:984)
    	at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:504)
    	at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:417)
    	at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:474)
    	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
    	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
    	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    	... 1 more
    ```

    A very frequent error is:

    ```bash
     WARN [2024-10-11 09:06:49,632] ({netty-rpc-connection-0} MacAddressUtil.java[defaultMachineId]:142) - Failed to find a usable hardware address from the network interfaces; using random bytes: 63:76:62:17:d6:9c:03:0a
    ```

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark gramine k8s cluster frequently fails to execute pyspark task fork #2010

{{title}}

Replies: 1 comment 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Spark gramine k8s cluster frequently fails to execute pyspark task fork #2010

zzylydx Sep 27, 2024

Replies: 1 comment · 11 replies

monavij Sep 27, 2024 Collaborator

mkow Oct 3, 2024 Maintainer

zzylydx Oct 3, 2024 Author

Environmental Information

Problem Description

woju Oct 3, 2024 Collaborator

zzylydx Oct 5, 2024 Author

zzylydx Oct 11, 2024 Author

Gramine v1.7 with spark Problems

zzylydx
Sep 27, 2024

Replies: 1 comment 11 replies

monavij
Sep 27, 2024
Collaborator

mkow Oct 3, 2024
Maintainer

zzylydx Oct 3, 2024
Author

woju Oct 3, 2024
Collaborator

zzylydx Oct 5, 2024
Author

zzylydx Oct 11, 2024
Author