Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible race conditions on intel MacOS #15730

Open
siddarthkay opened this issue Jul 23, 2024 · 0 comments
Open

possible race conditions on intel MacOS #15730

siddarthkay opened this issue Jul 23, 2024 · 0 comments
Labels
bug Something isn't working core-team E:Desktop Keycard Bug Bug found after initial keycard development wallet-team

Comments

@siddarthkay
Copy link
Contributor

Bug Report

Description

It was observed that status-desktop would often crash on onboarding stage on intel MacOS.
After trying to figure out why that happens here #15134
I found out that this issue was often fixed by doing nothing.
All the signs point to a possible race condition in the thread pool setup where a nim service calls some business logic in status-go or status-keycard-go.

I also discovered that for nim interop of go we free the memory by calling a go method which is called via a nim interface.
Exhibit A :
Lets take a look at how we consume keycardInitFlow which lives in vendor/status-keycard-go/shared/main.go

Source of keycardInitFlow is :

//export KeycardInitFlow
func KeycardInitFlow(storageDir *C.char) *C.char {
	var err error
	globalFlow, err = skg.NewFlow(C.GoString(storageDir))

	return retErr(err)
}

This function is wrapped around another wrapper in vendor/nim-keycard-go/keycard_go.nim and the source looks like this :

import ./keycard_go/impl as go_shim

export KeycardSignalCallback

proc keycardInitFlow*(storageDir: string): string =
  var funcOut = go_shim.keycardInitFlow(storageDir.cstring)
  defer: go_shim.free(funcOut)
  return $funcOut

It is also important to look at the source of go_shim :
The source lives here : vendor/nim-keycard-go/keycard_go/impl.nim

# go functions do not raise nim exceptions and do not interact with the Nim gc
{.push raises: [], gcsafe.}

proc free*(param: pointer) {.importc: "Free".}

proc keycardInitFlow*(storageDir: cstring): cstring {.importc: "KeycardInitFlow".}

Finally this code is being consumed in service.nim like this :

  proc init*(self: Service) =
    if self.doLogging:
      debug "init keycard using ", pairingsJson=status_const.KEYCARDPAIRINGDATAFILE
    let initResp = keycard_go.keycardInitFlow(status_const.KEYCARDPAIRINGDATAFILE)
    if self.doLogging:
      debug "initialization response: ", initResp

service.nim lives here : src/app_service/service/keycard/service.nim

I tried to create a minimal reproduction repo here but I was unable to reproduce the crash :
https://github.com/siddarthkay/status-desktop-intel-crash-reproducer
Although my efforts did not include a thread pool and that could be the key to reproducing the race condition.

Another key factor in discovering this race condition was upgrading go to 1.21.
go 1.21 has brought significant changes to its garbage collector and the crash we would see would often link to the code related to garbage collection.

error message :

bad flushGen 0 in prepareForSweep; sweepgen 14
fatal error: bad flushGen

reference in go source :
https://github.com/golang/go/blob/8f5c6904b616fd97dde4a0ba2f5c71114e588afd/src/runtime/mcache.go#L325

// prepareForSweep flushes c if the system has entered a new sweep phase
// since c was populated. This must happen between the sweep phase
// starting and the first allocation from c.
func (c *mcache) prepareForSweep() {
	// Alternatively, instead of making sure we do this on every P
	// between starting the world and allocating on that P, we
	// could leave allocate-black on, allow allocation to continue
	// as usual, use a ragged barrier at the beginning of sweep to
	// ensure all cached spans are swept, and then disable
	// allocate-black. However, with this approach it's difficult
	// to avoid spilling mark bits into the *next* GC cycle.
	sg := mheap_.sweepgen
	flushGen := c.flushGen.Load()
	if flushGen == sg {
		return
	} else if flushGen != sg-2 {
		println("bad flushGen", flushGen, "in prepareForSweep; sweepgen", sg)
		throw("bad flushGen")
	}
	c.releaseAll()
	stackcache_clear(c)
	c.flushGen.Store(mheap_.sweepgen) // Synchronizes with gcStart
}

At the moment this issue is mitigated by introducing some sleep time in this PR : #15194
However this is not a proper solution and we may run into race conditions elsewhere in the future.

Steps to reproduce

  • find and remove the sleep code and start a fresh app on intel MacOS

Expected behaviour

  • must not crash

Actual behaviour

  • crash
@siddarthkay siddarthkay added the bug Something isn't working label Jul 23, 2024
@jrainville jrainville added this to the 2.31.0 Beta milestone Jul 23, 2024
@jrainville jrainville added E:Desktop Keycard Bug Bug found after initial keycard development backend-team labels Jul 23, 2024
@alaibe alaibe removed this from the 2.31.0 Beta milestone Aug 27, 2024
@iurimatias iurimatias added this to the 2.32.0 Beta milestone Oct 30, 2024
@jrainville jrainville modified the milestones: 2.32.0 Beta, 2.33.0 Beta Dec 4, 2024
@iurimatias iurimatias modified the milestones: 2.33.0 Beta, 2.34.0 Beta Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core-team E:Desktop Keycard Bug Bug found after initial keycard development wallet-team
Projects
Status: No status
Development

No branches or pull requests

4 participants