Python program running slower inside Gvisor sandbox with ARM64 #10487

sfc-gh-jyin · 2024-05-29T19:13:53Z

Description

Hello,

We are currently benchmarking the cpu performance of gvisor compared to normal docker, and found out that same Python program running in gvisor is consistently slower compared to running on native kernel, or even with docker.

Note that we are aware of overhead introduced by additional hook for syscalls, but we are testing the cpu performance, and our test script does not issue syscalls.

The largest difference we observed so far are running on AWS c6gd.2xlarge instance. However, when running same suite on c7 instance family, the performance of gvisor is close to native kernel. Thus we are wondering what might be the rootcause of this, and how can we configure gvisor to make it perform better.

Test Environment: AWS c6gd.2xlarge instance with AL2 ami. Python. Python version: 3.7.16
Test script (Very simple pi calculation):

import time

def calculate_pi(n):
    pi = 0
    sign = 1
    for i in range(1, n * 2, 2):
        pi += sign * (4 / i)
        sign *= -1
    return pi

if __name__ == "__main__":
    iterations = 100000000 
    start = time.time() * 1000
    pi_approx = calculate_pi(iterations)
    print(time.time() * 1000 - start)

Running on native kernel:

$ python3 /tmp/pitest.py
16087.6728515625

Running with docker container:

$ sudo docker run -v /tmp/pitest.py:/tmp/pitest.py amazonlinux:2 yum install -y python3 && python3 /tmp/pitest.py
16197.6875

Running with runsc:

$ ./bin/runsc --network=none --rootless --platform=systrap run id-10
17876.38525390625

In all three cases, the process is consuming nearly 100% of cpu all time. However, when I use perf tool to check the stats, it shows that process started by gvisor runs with around 10~15% slower in terms on instructions per cycle:

Native:

    27,517,274,919      cycles                    #    2.482 GHz                      (29.92%)
    88,730,249,747      instructions              #    3.22  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (29.94%)
         11,087.35 msec cpu-clock                 #    0.923 CPUs utilized          
    38,231,913,137      cache-references          # 3448.309 M/sec                    (30.03%)
           258,817      cache-misses              #    0.001 % of all cache refs      (20.02%)
        34,731,416      branch-misses                                                 (20.02%)
     1,036,739,378      stalled-cycles-frontend   #    3.77% frontend cycles idle     (20.02%)
     1,103,210,544      stalled-cycles-backend    #    4.01% backend cycles idle      (20.02%)
                 5      sched:sched_switch        #    0.000 K/sec                  
    11,080,134,270      sched:sched_stat_runtime  #  999.367 M/sec                  
                 1      page-faults               #    0.000 K/sec                  
           281,234      L1-dcache-load-misses                                         (20.02%)
                 0      cpu-migrations            #    0.000 K/sec                  
         11,086.95 msec task-clock                #    0.923 CPUs utilized          
    27,549,011,550      bus-cycles                # 2484.770 M/sec                    (20.00%)
    38,076,864,326      mem_access                # 3434.324 M/sec                    (19.91%)

Gvisor:

    28,392,817,179      cycles                    #    2.482 GHz                      (29.98%)
    81,657,764,151      instructions              #    2.88  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (30.07%)
         11,440.44 msec cpu-clock                 #    0.953 CPUs utilized          
    35,172,201,924      cache-references          # 3074.430 M/sec                    (30.15%)
           922,867      cache-misses              #    0.003 % of all cache refs      (20.11%)
        32,162,856      branch-misses                                                 (20.02%)
       670,976,075      stalled-cycles-frontend   #    2.36% frontend cycles idle     (19.93%)
     3,126,416,518      stalled-cycles-backend    #   11.01% backend cycles idle      (19.93%)
                 1      sched:sched_switch        #    0.000 K/sec                  
    11,440,714,362      sched:sched_stat_runtime  # 1000.042 M/sec                  
                 1      page-faults               #    0.000 K/sec                  
           259,948      L1-dcache-load-misses                                         (19.93%)
                 0      cpu-migrations            #    0.000 K/sec                  
         11,440.04 msec task-clock                #    0.952 CPUs utilized          
    28,426,171,668      bus-cycles                # 2484.754 M/sec                    (19.93%)
    35,158,019,280      mem_access                # 3073.190 M/sec                    (19.93%)

We suspect that this could be due to memory access delay as on Gvisor case stalled-cycles-backend is significantly higher compared to other cases.

Steps to reproduce

Create AWS instance with c6 family. eg. c6gd.2xlarge
Run the script mentioned above in 3 environments

runsc version

runsc version 0.0.0
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

5.10.216-204.855.amzn2.aarch64 #1 SMP Sat May 4 16:53:24 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

The text was updated successfully, but these errors were encountered:

sfc-gh-jyin · 2024-05-31T08:04:24Z

After some investigation, I found one potential cause of this is that with Gvisor, Sentry implements its own runsc-memfd backed memory, and maps application virtual address space to memfd offset with its own VMA. However, even after the initial page fault, the memory access (especially memory write operations) are slower.

This seems not be directly related to Gvisor, as after the initial page fault after mmap, the memory access should be consistent with native linux kernel. However, the issue is memfd backed memory itself. For some reason, on c6gd AWS instance family, memory write operation through memfd always tends to be slower compared to directly writing to memory by around 5%.

avagin · 2024-05-31T16:29:09Z

@sfc-gh-jyin I think I found the real root cause of this issue. gVisor never sets the SSBS bit in pstate. With the next patch, I get same results in gVisor and without it:

diff --git a/pkg/sentry/arch/arch_aarch64.go b/pkg/sentry/arch/arch_aarch64.go
index 04262f6c5..e4a1d0187 100644
--- a/pkg/sentry/arch/arch_aarch64.go
+++ b/pkg/sentry/arch/arch_aarch64.go
@@ -257,12 +257,14 @@ func (s *State) FullRestore() bool {
 func New(arch Arch) *Context64 {
        switch arch {
        case ARM64:
-               return &Context64{
+               c:= &Context64{
                        State{
                                fpState: fpu.NewState(),
                        },
-                       []fpu.State(nil),
+                       []fpu.State{nil},
                }
+               c.Regs.Pstate |= linux.PSR_SSBS_BIT
+               return c
        }
        panic(fmt.Sprintf("unknown architecture %v", arch))
 }
diff --git a/pkg/sentry/arch/signal_arm64.go b/pkg/sentry/arch/signal_arm64.go
index 1118d6a7f..959d6068b 100644
--- a/pkg/sentry/arch/signal_arm64.go
+++ b/pkg/sentry/arch/signal_arm64.go
@@ -157,7 +157,7 @@ func (regs *Registers) validRegs() bool {
        }
 
        // Force PSR to a valid 64-bit EL0t
-       regs.Pstate &= linux.PSR_N_BIT | linux.PSR_Z_BIT | linux.PSR_C_BIT | linux.PSR_V_BIT
+       regs.Pstate &= linux.PSR_N_BIT | linux.PSR_Z_BIT | linux.PSR_C_BIT | linux.PSR_V_BIT | linux.PSR_SSBS_BIT
        return false
 }

This isn't a proper fix. We need to figure out when SSBS should be set.

sfc-gh-jyin · 2024-05-31T17:41:20Z

Thank you @avagin! I tried your patch and it did help! Can we get this fix merged in main? Also, do you know why this issue did not manifest at similar degree for c7gd instances?

sfc-gh-jyin · 2024-06-04T17:09:21Z

@avagin I have another question... Based on my understanding, PSR_SSBS_BIT enables mitigation to security vulnerabilities introduced by Speculative Execution. Can you share some information on why setting this flag would improve the performance on gVisor?

jaingaurav · 2024-06-04T17:32:56Z

Further, adding to @sfc-gh-jyin's questions, is there a reason that c7 instances would not experience this slowdown? I believe c6 are Graviton2 (Neoverse N1) and c7 are Graviton3 (Neoverse V1).

avagin · 2024-06-04T17:37:04Z

@sfc-gh-jyin it isn't only about gvisor. When you run you test on LInux, this bit is set in pstate and this is why you see a better performance. If you care about security and want to be safe from ssb, you probably want to disable PR_SPEC_STORE_BYPASS that is effectively drops PSR_SSBS_BIT from pstate.

More info about the meaning of this bit can be found here: https://developer.arm.com/documentation/ddi0595/2020-12/AArch64-Registers/SSBS--Speculative-Store-Bypass-Safe.

The last line in my previous comment says that the patch isn't a fix and it is just there for explaining what is going on. We need to figure out when we can/need to set this bit. It should not be set by default to protect against SSB.

avagin · 2024-06-04T17:38:49Z

@jaingaurav My guess is that they found another way to mitigate SSB in these cpu-s.

sfc-gh-jyin added the type: bug Something isn't working label May 29, 2024

avagin added status: help wanted Extra attention is needed area: platform Issue related to platforms (kvm, ptrace) labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python program running slower inside Gvisor sandbox with ARM64 #10487

Python program running slower inside Gvisor sandbox with ARM64 #10487

sfc-gh-jyin commented May 29, 2024 •

edited

Loading

sfc-gh-jyin commented May 31, 2024 •

edited

Loading

avagin commented May 31, 2024

sfc-gh-jyin commented May 31, 2024 •

edited

Loading

sfc-gh-jyin commented Jun 4, 2024

jaingaurav commented Jun 4, 2024

avagin commented Jun 4, 2024

avagin commented Jun 4, 2024

Python program running slower inside Gvisor sandbox with ARM64 #10487

Python program running slower inside Gvisor sandbox with ARM64 #10487

Comments

sfc-gh-jyin commented May 29, 2024 • edited Loading

Description

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

sfc-gh-jyin commented May 31, 2024 • edited Loading

avagin commented May 31, 2024

sfc-gh-jyin commented May 31, 2024 • edited Loading

sfc-gh-jyin commented Jun 4, 2024

jaingaurav commented Jun 4, 2024

avagin commented Jun 4, 2024

avagin commented Jun 4, 2024

sfc-gh-jyin commented May 29, 2024 •

edited

Loading

sfc-gh-jyin commented May 31, 2024 •

edited

Loading

sfc-gh-jyin commented May 31, 2024 •

edited

Loading