Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: build fails when run via QEMU for linux/amd64 running on linux/arm64 #69255

Open
myitcv opened this issue Sep 4, 2024 · 24 comments
Open
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. GoCommand cmd/go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@myitcv
Copy link
Member

myitcv commented Sep 4, 2024

Go version

go version go1.23.0 linux/arm64

Output of go env in your module/workspace:

$ go env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/home/myitcv/.cache/go-build'
GOENV='/home/myitcv/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/myitcv/gostuff/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/myitcv/gostuff'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/myitcv/gos'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/home/myitcv/gos/pkg/tool/linux_arm64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='on'
GOTELEMETRYDIR='/home/myitcv/.config/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/myitcv/tmp/dockertests/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build810191502=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Given:

-- Dockerfile --
FROM golang:1.23.0

WORKDIR /app
COPY . ./

RUN go build -o asdf ./blah

-- blah/main.go --
package main

func main() {

}
-- go.mod --
module mod.example

go 1.23.0

Running:

docker buildx build --platform linux/amd64 .

What did you see happen?

[+] Building 0.8s (8/8) FINISHED                                                                                                                                 docker-container:container-builder
 => [internal] load build definition from Dockerfile                                                                                                                                           0.0s
 => => transferring dockerfile: 110B                                                                                                                                                           0.0s
 => [internal] load metadata for docker.io/library/golang:1.23.0                                                                                                                               0.4s
 => [internal] load .dockerignore                                                                                                                                                              0.0s
 => => transferring context: 2B                                                                                                                                                                0.0s
 => [internal] load build context                                                                                                                                                              0.0s
 => => transferring context: 271B                                                                                                                                                              0.0s
 => CACHED [1/4] FROM docker.io/library/golang:1.23.0@sha256:613a108a4a4b1dfb6923305db791a19d088f77632317cfc3446825c54fb862cd                                                                  0.0s
 => => resolve docker.io/library/golang:1.23.0@sha256:613a108a4a4b1dfb6923305db791a19d088f77632317cfc3446825c54fb862cd                                                                         0.0s
 => [2/4] WORKDIR /app                                                                                                                                                                         0.0s
 => [3/4] COPY . ./                                                                                                                                                                            0.0s
 => ERROR [4/4] RUN go build -o asdf ./blah                                                                                                                                                    0.3s
------
 > [4/4] RUN go build -o asdf ./blah:
0.268 runtime: lfstack.push invalid packing: node=0xffffa45142c0 cnt=0x1 packed=0xffffa45142c00001 -> node=0xffffffffa45142c0
0.268 fatal error: lfstack.push
0.270
0.270 runtime stack:
0.270 runtime.throw({0xaf644d?, 0x0?})
0.271   runtime/panic.go:1067 +0x48 fp=0xc000231f08 sp=0xc000231ed8 pc=0x471228
0.271 runtime.(*lfstack).push(0xffffa45040b8?, 0xc0005841c0?)
0.271   runtime/lfstack.go:29 +0x125 fp=0xc000231f48 sp=0xc000231f08 pc=0x40ef65
0.271 runtime.(*spanSetBlockAlloc).free(...)
0.271   runtime/mspanset.go:322
0.271 runtime.(*spanSet).reset(0xfe7680)
0.271   runtime/mspanset.go:264 +0x79 fp=0xc000231f78 sp=0xc000231f48 pc=0x433559
0.271 runtime.finishsweep_m()
0.272   runtime/mgcsweep.go:257 +0x8d fp=0xc000231fb8 sp=0xc000231f78 pc=0x4263ad
0.272 runtime.gcStart.func2()
0.272   runtime/mgc.go:702 +0xf fp=0xc000231fc8 sp=0xc000231fb8 pc=0x46996f
0.272 runtime.systemstack(0x0)
0.272   runtime/asm_amd64.s:514 +0x4a fp=0xc000231fd8 sp=0xc000231fc8 pc=0x4773ca
...

My setup here is my host machine is linux/arm64, Qemu installed, following the approach described at https://docs.docker.com/build/building/multi-platform/#qemu, to build for linux/amd64.

This has definitely worked in the past which leads me to suggest that something other than Go has changed/been broken here. However I note the virtually identical call stack reported in #54104 hence raising here in the first instance.

What did you expect to see?

Successful run of docker build.

@dmitshur
Copy link
Contributor

dmitshur commented Sep 4, 2024

Do you think this is this similar or related to issue #68976? (It wasn't listed in the comment above, but it feels similar from a quick initial look.)

CC @prattmic, @matloob.

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. GoCommand cmd/go compiler/runtime Issues related to the Go compiler and/or runtime. labels Sep 4, 2024
@dmitshur dmitshur added this to the Backlog milestone Sep 4, 2024
@dmitshur dmitshur changed the title cmd/go: build fails when run via QEMU for linux/amd64 running on linux/arm64 cmd/go, runtime: build fails when run via QEMU for linux/amd64 running on linux/arm64 Sep 4, 2024
@myitcv
Copy link
Member Author

myitcv commented Sep 4, 2024

Do you think this is this similar or related to issue #68976?

I don't know I'm afraid. That said the stack trace and symptoms seem quite different. I will however defer to @prattmic

@prattmic
Copy link
Member

prattmic commented Sep 4, 2024

I agree, it looks quite different. #68976 is very specific to pidfd use in os/syscall. This looks like some form of corruption.

Do you know if this build is running a full Linux kernel in a VM, or using QEMU user mode Linux emulation?

@prattmic
Copy link
Member

prattmic commented Sep 4, 2024

0.268 runtime: lfstack.push invalid packing: node=0xffffa45142c0 cnt=0x1 packed=0xffffa45142c00001 -> node=0xffffffffa45142c0

Notice

node=0xffffa45142c0       # before
node=0xffffffffa45142c0   # after

This seems like a sign extension issue when right shifting the packed value (See https://cs.opensource.google/go/go/+/master:src/runtime/lfstack.go;l=26-30, specifically lfstackUnpack).

I could imagine this being a code generation issue, or an issue in QEMU instruction emulation.

cc @golang/compiler

@prattmic prattmic changed the title cmd/go, runtime: build fails when run via QEMU for linux/amd64 running on linux/arm64 runtime: build fails when run via QEMU for linux/amd64 running on linux/arm64 Sep 4, 2024
@prattmic
Copy link
Member

prattmic commented Sep 4, 2024

Does the same issue occur on Go 1.22?

@myitcv
Copy link
Member Author

myitcv commented Sep 4, 2024

Does the same issue occur on Go 1.22?

Yes. Indeed similar looking stacks for 1.21.13, 1.22.6, 1.23.0. Confirmed via:

cat <<EOD > template.txtar
-- Dockerfile --
FROM golang:$GOVERSION

WORKDIR /app
COPY . ./

RUN go build -o asdf ./blah

-- blah/main.go --
package main

func main() {

}
-- go.mod --
module mod.example

go $GOVERSION
EOD
for i in 1.23.0 1.22.6 1.21.13
do
        mkdir $i
        pushd $i > /dev/null
cat ../template.txtar | GOVERSION=$i envsubst | txtar-x
docker buildx build --platform linux/amd64 . > output 2>&1
popd > /dev/null
done
cat */output

@myitcv
Copy link
Member Author

myitcv commented Sep 4, 2024

I'm miles out of my depth here, but in case this is useful:

$ qemu-amd64-static --version
qemu-x86_64 version 9.0.2 (Debian 1:9.0.2+ds-2+b1)
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers

@myitcv
Copy link
Member Author

myitcv commented Sep 4, 2024

... but just to be super clear, I'm doing this via Docker:

https://docs.docker.com/build/building/multi-platform/#qemu

(so I'm actually unsure whether the host system qemu is used or not)

@prattmic
Copy link
Member

I will see if I can reproduce when I get a chance.

As a workaround, do you actually need to do linux-amd64 builds via QEMU emulation? Go can cross-compile on its own well, though perhaps you have cgo dependencies that make it difficult?

@mvdan
Copy link
Member

mvdan commented Sep 11, 2024

We did end up with a two-stage Dockerfile where the builder is on the host platform, cross-compiles to the target platform without cgo, and then the second stage builds an image for the target platform. So while we are not blocked by this bug as there's a workaround, it's probably worth keeping it open for a fix.

@stsquad
Copy link

stsquad commented Sep 13, 2024

We did some investigation for: https://gitlab.com/qemu-project/qemu/-/issues/2560 and we suspect the fault comes down to aarch64 only having 47 or 39 bits of address space while the x86_64 GC assume 48 bits. Under linux-user emulation we are limited by the host address space. However I do note 48 was chosen for all arches so I wonder how this works on native aarch64 builds of go?

@prattmic
Copy link
Member

Thanks for taking a look!

cc @mknyszek who can speak more definitively about the address space layout, but I don't a smaller address space should be a problem. Go is pretty lenient about what it gets from mmap. I don't think we ever demand to be able to get a mapping with the 47th bit set.

If you haven't already seen it, take a look at #69255 (comment). My suspicion is that this is some sort of sign-extension bug given the only difference between the expected and actual output is the value of the upper bits.

@prattmic
Copy link
Member

That said, on further thought, the input address 0xffffa45142c0 does look pretty weird. That isn't a typical heap address (the other addresses in the stack trace, e.g., sp=0xc000231ed8 do look like typical Go heap addresses), so I wonder how we got this one?

@cherrymui
Copy link
Member

https://cs.opensource.google/go/go/+/master:src/runtime/malloc.go;l=149-210 this comment is about the heap address layout. We do use smaller address spaces on a few platforms, e.g. ios/arm64 is 40-bit, but the bits are set as constants so it would probably equally apply to native build and QEMU. (We could consider a qemu build tag?)

@prattmic
Copy link
Member

Yes, we configure a larger heap address layout, but will anything break if the OS simply never returns addresses in the upper range? There isn't a case I can think of, provided our biggest mappings fit in the restricted address space. (Notice that amd64 configures 48-bit address space, even though Linux will only return addresses in the lower 47 bits)

In gVisor, we would restrict the Go runtime to a 39-bit region of address space without problem or modification to the Go runtime.

@cherrymui
Copy link
Member

I think nothing would break if the OS never returns high addresses. The heapAddrBits is an upper limit, I think.

@stsquad
Copy link

stsquad commented Sep 13, 2024

Are there any runes for running the Go test cases (nothing jumped out at me). If we can trigger the failure with a direct testcase rather than deep in a docker image we can take a look at verifying the instruction behaviour.

@prattmic
Copy link
Member

prattmic commented Sep 13, 2024

I have not personally reproduced, but in #69255 (comment) it is the compiler itself crashing, so theoretically it should reproduce by:

  1. Download a copy of Go and extract somewhere (which I'll call $EXTRACT_DIR): https://go.dev/dl/
  2. Create folder containing go.mod and main.go:

go.mod:

module example.com/app

go 1.23.1

main.go:

package main
func main() {}
  1. In the directory with go.mod/main.go, run $EXTRACT_DIR/bin/go build.

This will hopefully crash somewhere in the toolchain/compiler.

That said, go build does invoke multiple subprocesses, which I imagine could make debugging annoying. If you want literally just a single binary, you could try building a single test binary:

From outside QEMU (on any type of host), run GOOS=linux GOARCH=amd64 go test -c sort. This will build a sort.test linux-amd64 binary that contains the unit tests for the sort standard library package. I selected that package mostly arbitrarily: it is fairly complex so I hope it will trigger the bug and it has no dependency on external testdata files.

sort.test is a standalone, statically-linked binary, so you can copy it wherever and just run it. I do recommend passing ./sort.test -test.count=10 just to make it run long enough to run the GC.

@zekth
Copy link

zekth commented Dec 6, 2024

I stumbled upon this issue and found a solution (at least for my setup).
The host is an arm-64 ubuntu host.

 docker:
    runs-on: ubuntu-latest-arm64-kong # our private arm64 runner instance
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
        with:
          install: true

      - name: Build mailbox Container
        uses: docker/build-push-action@v6
        with:
          context: .
          file: cmd/Dockerfile
          push: true
          cache-from: type=gha
          cache-to: type=gha,mode=max
          platforms: linux/amd64,linux/arm64
          tags: foo
ARG BUILDPLATFORM
FROM --platform=$BUILDPLATFORM golang:1.23-bullseye AS build // this is really important
ARG TARGETARCH

COPY . .

RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build \
    -o /build/my-binary ./cmd/main.go

So what happens in an arm64 environment is you want to build the image by pulling the arm64 image by specifying the --platform in the FROM statement, without it it doesn't seem to work; it generates segfault on some libs. I assume it "can" work but as said some libs my break.

Then when checking the build progress you'll notice those instructions:

[linux/arm64->amd64 build 7/7] RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build
[linux/arm64 build 7/7] RUN CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build

hope this helps

@mwyvr
Copy link

mwyvr commented Dec 15, 2024

Possibly related, Go is failing to compile any non-trivial application on a Vultr virtual machine running FreeBSD as a guest on FreeBSD 14.1 and 14.2-RELEASE, tested on 1.21 and latest, 1.23.4.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283314

On real hardware, no issues compiling or running; however when I move the binary to the VM, unpredictable panics happen and eventually a seg fault in the application (mox, a full stack mail server).

I stumbled across an old discussion that raised using GODEBUG=asyncpreemptoff=1 and this does seem to have a positive effect on compilation; I'm running mox compiled with this option and so far so good but it is unclear to me what the overall impact of this is.

@cherrymui
Copy link
Member

I stumbled across an old discussion that raised using GODEBUG=asyncpreemptoff=1 and this does seem to have a positive effect on compilation

This usually indicates that the virtual machine (or the OS running on it) has some bug in handling asynchronous signals. You could probably test it with a C program that sends itself a lot asynchronous signals. (See also #46272, and some test programs linked from it.) Are you also running an AMD64 VM instance on an ARM64 machine?

@mwyvr
Copy link

mwyvr commented Dec 17, 2024

The problem VM is an AMD64 VM instance on what appears to be AMD64; the provider is Vultr.com; the actual hw is said to be Xeon CPUs. Reported by the VM:

❯ sysctl hw
hw.machine: amd64
hw.model: Intel Core Processor (Skylake, IBRS)

From #46272 I ran the @kostikbel 's avx_sig.c code from this comment on the problematic VM; it reliably SIGABRTs on every single run, more or less instantly.

The code runs without apparent issue (10 minutes each before I interrupted) on:

  • on a different VM host provider using kvm/qemu; guest is 14.2-RELEASE (hw.model: Intel Core Processor (Skylake, IBRS))
  • real hardware running 14.2-RELEASE (hw.model: Intel(R) Core(TM) i9-14900K)
  • on a Bhyve VM running 14.2-RELEASE, on ^ real hardware host

I first noted unusual behaviour on FreeBSD 14.1 on the VM in question with random panics that didn't make sense from a Go mail server (SMTP, IMAP etc) that I migrated in November to FreeBSD from Linux on that very same VM instance. There were no panics on Linux.

cc @emaste @kostikbel from the runtime: possible memory corruption on FreeBSD issue.

@prattmic
Copy link
Member

It sounds like you have more-or-less narrowed this down to a VMM bug on Vultr's side, likely related to save/restore of FPU state. If you have no already you should definitely take this up with them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. GoCommand cmd/go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

9 participants