Skip to content

Remove usage of reserved registers on arm64#47

Closed
StephenButtolph wants to merge 7 commits intocloudflare:masterfrom
StephenButtolph:fix-arm64-crash
Closed

Remove usage of reserved registers on arm64#47
StephenButtolph wants to merge 7 commits intocloudflare:masterfrom
StephenButtolph:fix-arm64-crash

Conversation

@StephenButtolph
Copy link

Resolves #46.

This removes all usage of R27 and R29 from gfpMul.

This fix introduces 3 additional MOVD instructions (1 for the removal of R29 and 2 for the removal of R27).

I couldn't figure out any less costly way of removing the usage of the registers (as all of the available registers are used).

I definitely don't love the style of the change (especially from the last commit where the additional arguments to mul were added)... So any suggestions there would be especially appreciated.

@StephenButtolph
Copy link
Author

I tried to make the commits individually reviewable/build up to the solution. So imo (fwiw) reviewing commit-by-commit is the easiest way to review this change.

@StephenButtolph
Copy link
Author

Probably should have included benchmarks:

Running locally on my laptop:

goos: darwin
goarch: arm64
pkg: github.com/cloudflare/bn256
cpu: Apple M2 Max
                    │ master.bench │       fix-arm64-crash.bench        │
                    │    sec/op    │   sec/op     vs base               │
G1-12                  72.84µ ± 4%   73.67µ ± 3%       ~ (p=0.912 n=10)
G2-12                  219.8µ ± 3%   224.4µ ± 4%       ~ (p=0.218 n=10)
GT-12                  553.0µ ± 4%   549.5µ ± 5%       ~ (p=0.853 n=10)
Pairing-12             696.9µ ± 0%   699.6µ ± 0%       ~ (p=0.089 n=10)
HashG1Size8bytes-12    25.20µ ± 1%   25.47µ ± 1%  +1.09% (p=0.029 n=10)
HashG1Size1k-12        19.55µ ± 1%   19.66µ ± 1%  +0.56% (p=0.015 n=10)
HashG1Size8k-12        27.90µ ± 1%   28.17µ ± 1%  +0.98% (p=0.000 n=10)
geomean                97.67µ        98.45µ       +0.80%

                    │ master.bench │         fix-arm64-crash.bench         │
                    │     B/op     │     B/op      vs base                 │
G1-12                   160.0 ± 0%     160.0 ± 0%       ~ (p=1.000 n=10) ¹
G2-12                   256.0 ± 0%     256.0 ± 0%       ~ (p=1.000 n=10) ¹
GT-12                   416.0 ± 0%     416.0 ± 0%       ~ (p=1.000 n=10) ¹
Pairing-12            38.25Ki ± 0%   38.25Ki ± 0%       ~ (p=1.000 n=10) ¹
HashG1Size8bytes-12   1.938Ki ± 0%   1.938Ki ± 0%       ~ (p=1.000 n=10) ¹
HashG1Size1k-12       1.907Ki ± 0%   1.907Ki ± 0%       ~ (p=1.000 n=10) ¹
HashG1Size8k-12       1.938Ki ± 0%   1.938Ki ± 0%       ~ (p=1.000 n=10) ¹
geomean               1.234Ki        1.234Ki       +0.00%
¹ all samples are equal

                    │ master.bench │        fix-arm64-crash.bench        │
                    │  allocs/op   │ allocs/op   vs base                 │
G1-12                   2.000 ± 0%   2.000 ± 0%       ~ (p=1.000 n=10) ¹
G2-12                   1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
GT-12                   2.000 ± 0%   2.000 ± 0%       ~ (p=1.000 n=10) ¹
Pairing-12              352.0 ± 0%   352.0 ± 0%       ~ (p=1.000 n=10) ¹
HashG1Size8bytes-12     29.00 ± 0%   29.00 ± 0%       ~ (p=1.000 n=10) ¹
HashG1Size1k-12         28.00 ± 0%   28.00 ± 0%       ~ (p=1.000 n=10) ¹
HashG1Size8k-12         29.00 ± 0%   29.00 ± 0%       ~ (p=1.000 n=10) ¹
geomean                 11.87        11.87       +0.00%
¹ all samples are equal

                    │ master.bench │        fix-arm64-crash.bench        │
                    │     B/s      │     B/s       vs base               │
HashG1Size8bytes-12   312.5Ki ± 0%   302.7Ki ± 3%       ~ (p=0.057 n=10)
HashG1Size1k-12       49.96Mi ± 1%   49.68Mi ± 1%  -0.55% (p=0.016 n=10)
HashG1Size8k-12       280.0Mi ± 1%   277.3Mi ± 1%  -0.97% (p=0.000 n=10)
geomean               16.22Mi        15.97Mi       -1.56%

As expected (tbh probably within random perturbation) the new code is slightly slower.

@StephenButtolph
Copy link
Author

I don't want to be a bother, but just want to make sure you (@armfazh) saw this PR.

If there is anything I can do to help out to make review easier let me know.

Comment on lines +3 to +6
#define RELOAD \
MOVD ·p2+0(SB), R5

#define mul(c0,c1,c2,c3,c4,c5,c6,c7,t0,reset) \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might make this a bit more clear.

Suggested change
#define RELOAD \
MOVD ·p2+0(SB), R5
#define mul(c0,c1,c2,c3,c4,c5,c6,c7,t0,reset) \
#define ResetR5 \
MOVD ·p2+0(SB), R5
#define mul(c0,c1,c2,c3,c4,c5,c6,c7,maybeR5,maybeResetR5) \

UMULH R2, R6, c6 \
MUL R2, R7, R0 \
ADCS R0, R27 \
UMULH R2, R7, R29 \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could continue to use R29 if we decided to just save the value onto the stack at the start of gfpMul and then we restore it prior to RET. That would mean that we wouldn't need the t0 and reset modifications to this macro (as we can replace R27 with R29 and only restore R6 in gfpReduce.

This would allocate 8 bytes onto the stack, but would keep the same number of additional instructions (3). The current modification has 3 MOVD instructions, this would have a push+pop+MOVD.

If that seems preferable then I can switch to that approach.

Comment on lines +138 to +144
lock.Lock()
// Make it more likely for goroutines to block on this mutex.
time.Sleep(time.Microsecond)

// If the frame pointer was corrupted, and another goroutine is
// blocked on this mutex, then this will segfault.
lock.Unlock()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this an arm specific test and include some asm code to explicitly return R27 and R29 so that we verify gfpMul doesn't modify them. I felt like that was a bit more work than I wanted to do... But this test is a bit contrived as it is right now (and it only verifies R29 as not being corrupted)

loadBlock(0(R0), R5,R6,R7,R8)

mul(R9,R10,R11,R12,R13,R14,R15,R16)
mul(R9,R10,R11,R12,R13,R14,R15,R16,R17,EMPTY)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: R17 is used later, but at this point it is uninitialized, so we can use it and not care about it's value being modified.

@armfazh
Copy link
Contributor

armfazh commented May 16, 2025

yes, thanks for this. overall it looks ok, but I have to do a deeper review.

@armfazh
Copy link
Contributor

armfazh commented May 19, 2025

Closing since I found a way in #48 to remove reserved registers without using stack based on your changes.
Also I could not reproduce the test, test passes even though the reserved registers are used.

@armfazh armfazh closed this May 19, 2025
@StephenButtolph
Copy link
Author

Also I could not reproduce the test, test passes even though the reserved registers are used.

Interesting... It fails every time on my computer... What is your GOENV? This is mine:

$ go env     
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/stephen/Library/Caches/go-build'
GOENV='/Users/stephen/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/stephen/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/stephen/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.4'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/stephen/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/stephen/go/src/github.com/cloudflare/bn256/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/5l/z00z8zbd0dbd6lf8djdgl_000000gn/T/go-build996554392=/tmp/go-build -gno-record-gcc-switches -fno-common'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ARM64 asm uses reserved registers

2 participants

Comments