Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Race: node catchpoint mode context R/W #3945

Closed
Eric-Warehime opened this issue May 2, 2022 · 3 comments
Closed

Data Race: node catchpoint mode context R/W #3945

Eric-Warehime opened this issue May 2, 2022 · 3 comments

Comments

@Eric-Warehime
Copy link
Contributor

Eric-Warehime commented May 2, 2022

Subject of the issue

During the process of upgrading golang to 1.17.9 in go-algorand (#3919 ) we're seeing the data race detector reporting potential issues getting/setting the node's context.

You can see the issue https://app.circleci.com/pipelines/github/algorand/go-algorand/6177/workflows/c4decd97-e28b-4651-b897-a01c9bf708b6/jobs/105158/parallel-runs/1 in the CI run.

WARNING: DATA RACE
        Write at 0x00c000225910 by goroutine 60:
          github.com/algorand/go-algorand/node.(*AlgorandFullNode).SetCatchpointCatchupMode.func1()
              github.com/algorand/go-algorand/node/node.go:1232 +0x324
        
        Previous read at 0x00c000225910 by goroutine 31:
          github.com/algorand/go-algorand/node.(*AlgorandFullNode).oldKeyDeletionThread()
              github.com/algorand/go-algorand/node/node.go:1078 +0xd1
          github.com/algorand/go-algorand/node.(*AlgorandFullNode).startMonitoringRoutines·dwrap·4()
              github.com/algorand/go-algorand/node/node.go:414 +0x39
        
        Goroutine 60 (running) created at:
          github.com/algorand/go-algorand/node.(*AlgorandFullNode).SetCatchpointCatchupMode()
              github.com/algorand/go-algorand/node/node.go:1206 +0xf8
          github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).updateNodeCatchupMode()
              github.com/algorand/go-algorand/catchup/catchpointService.go:695 +0x6f
          github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageInactive()
              github.com/algorand/go-algorand/catchup/catchpointService.go:260 +0x347
          github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run()
              github.com/algorand/go-algorand/catchup/catchpointService.go:198 +0x13c
          github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).Start·dwrap·1()
              github.com/algorand/go-algorand/catchup/catchpointService.go:156 +0x39
        
        Goroutine 31 (running) created at:
          github.com/algorand/go-algorand/node.(*AlgorandFullNode).startMonitoringRoutines()
              github.com/algorand/go-algorand/node/node.go:414 +0x178
          github.com/algorand/go-algorand/node.(*AlgorandFullNode).Start()
              github.com/algorand/go-algorand/node/node.go:399 +0x724
          github.com/algorand/go-algorand/daemon/algod.(*Server).Start()
              github.com/algorand/go-algorand/daemon/algod/server.go:200 +0x17a
          main.run()
              github.com/algorand/go-algorand/cmd/algod/main.go:369 +0x2f65
          main.main()
              github.com/algorand/go-algorand/cmd/algod/main.go:62 +0xa9

The write does acquire the node.mu lock, but the read (and potentially a few others as well) do not lock when reading the context.

There does seem to be some diff between race detectors in go 1.16-1.17 so it's possible that the updated race detector is the delta that's causing this now.

Repro

I can repro this using make integration on Mac M1 using go1.17.9 changes mentioned in the PR linked above.
The CI tests show this on amd64 however, so I expect it will show up on all archs.

@Eric-Warehime
Copy link
Contributor Author

Commit which updates this code as well as adds the test which is showing the data race d868c91

@Eric-Warehime
Copy link
Contributor Author

I'm able to repro the failure individually without running all integ tests. It looks like it's not in the expect test, but in the go tests. Relevant log from test output:

    testingLogger.go:38: time="2022-05-03T10:05:11.087892 -0700" level=info msg="done building!\n" file=catchpointCatchup_test.go function=github.com/algorand/go-algorand/test/e2e-go/features/catchup.TestBasicCatchpointCatchup line=171
    testingLogger.go:38: time="2022-05-03T10:05:11.093546 -0700" level=info msg="web proxy listens at 127.0.0.1:58881\n" file=catchpointCatchup_test.go function=github.com/algorand/go-algorand/test/e2e-go/features/catchup.TestBasicCatchpointCatchup line=186
algod(14955) : Telemetry configured from '/Users/eric/.algorand/logging.config'
algod(14955) : No REST API Token found. Generated token: 97e8d759137cb77e0ac0d0a02c087766fe2ae57ee4ef292ef75014b5e4492acf
algod(14955) : No Admin REST API Token found. Generated token: 6b0ca8e21a16ef0dcfe479a74650affeb15e556f62df84bccd4d073fc73a59f8
algod(14955) : Logging to:  /var/folders/c6/rd2c_3c52yz5t3qgwjs5yrz40000gp/T/tmp1533377312/TestBasicCatchpointCatchup/Node/node.log
algod(14955) : Deadlock detection is set to: disabled (Default state is 'disable')
algod(14955) : Initializing the Algorand node... 
algod(14955) : Success!
algod(14955) : ⇨ http server started on 127.0.0.1:58884
algod(14955) : Node running and accepting RPC requests over HTTP on port 127.0.0.1:58884. Press Ctrl-C to exit
    testingLogger.go:38: time="2022-05-03T10:05:11.605754 -0700" level=info msg="Second node catching up to round 1" file=catchpointCatchup_test.go function=github.com/algorand/go-algorand/test/e2e-go/features/catchup.TestBasicCatchpointCatchup line=202
    testingLogger.go:38: time="2022-05-03T10:05:19.663463 -0700" level=info msg=" - done catching up!\n" file=catchpointCatchup_test.go function=github.com/algorand/go-algorand/test/e2e-go/features/catchup.TestBasicCatchpointCatchup line=212
    testingLogger.go:38: time="2022-05-03T10:05:19.666077 -0700" level=info msg="primary node latest catchpoint - 36#XQN437LJ5AT7T74RKB2EPSELEKFKKAN5RSH2BWOVELDEQYGGFGMA!\n" file=catchpointCatchup_test.go function=github.com/algorand/go-algorand/test/e2e-go/features/catchup.TestBasicCatchpointCatchup line=217
    testingLogger.go:38: time="2022-05-03T10:05:19.705230 -0700" level=info msg="Second node catching up to round 36" file=catchpointCatchup_test.go function=github.com/algorand/go-algorand/test/e2e-go/features/catchup.TestBasicCatchpointCatchup line=222
algod(14955) : ==================
algod(14955) : WARNING: DATA RACE
algod(14955) : Write at 0x00c0001d0510 by goroutine 150:
algod(14955) :   github.com/algorand/go-algorand/node.(*AlgorandFullNode).SetCatchpointCatchupMode.func1()
algod(14955) :       github.com/algorand/go-algorand/node/node.go:1232 +0x274
algod(14955) : Previous read at 0x00c0001d0510 by goroutine 159:
algod(14955) :   github.com/algorand/go-algorand/node.(*AlgorandFullNode).oldKeyDeletionThread()
algod(14955) :       github.com/algorand/go-algorand/node/node.go:1078 +0xa4
algod(14955) : Goroutine 150 (running) created at:
algod(14955) :   github.com/algorand/go-algorand/node.(*AlgorandFullNode).SetCatchpointCatchupMode()
algod(14955) :       github.com/algorand/go-algorand/node/node.go:1206 +0x6c
algod(14955) :   github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).updateNodeCatchupMode()
algod(14955) :       github.com/algorand/go-algorand/catchup/catchpointService.go:695 +0x58
algod(14955) :   github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).processStageInactive()
algod(14955) :       github.com/algorand/go-algorand/catchup/catchpointService.go:260 +0x328
algod(14955) :   github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run()
algod(14955) :       github.com/algorand/go-algorand/catchup/catchpointService.go:198 +0xfc
algod(14955) : Goroutine 159 (running) created at:
algod(14955) :   github.com/algorand/go-algorand/node.(*AlgorandFullNode).startMonitoringRoutines()
algod(14955) :       github.com/algorand/go-algorand/node/node.go:414 +0xa0
algod(14955) :   github.com/algorand/go-algorand/node.(*AlgorandFullNode).Start()
algod(14955) :       github.com/algorand/go-algorand/node/node.go:399 +0x630
algod(14955) :   github.com/algorand/go-algorand/daemon/algod.(*Server).Start()
algod(14955) :       github.com/algorand/go-algorand/daemon/algod/server.go:200 +0x174
algod(14955) :   main.run()
algod(14955) :       github.com/algorand/go-algorand/cmd/algod/main.go:369 +0x2d60
algod(14955) :   main.main()
algod(14955) :       github.com/algorand/go-algorand/cmd/algod/main.go:62 +0xb8
algod(14955) : ==================
    testingLogger.go:38: time="2022-05-03T10:05:29.082393 -0700" level=info msg="done catching up!\n" file=catchpointCatchup_test.go function=github.com/algorand/go-algorand/test/e2e-go/features/catchup.TestBasicCatchpointCatchup line=231
algod(14955) : Exiting on terminated
algod(14955) : Found 1 data race(s)
algod(14915) : Exiting on terminated
    fixture.go:119: 
        	Error Trace:	catchpointCatchup_test.go:78
        	           				catchpointCatchup_test.go:235
        	Error:      	Received unexpected error:
        	           	exit status 66
        	Test:       	TestBasicCatchpointCatchup
        	Messages:   	Node at /var/folders/c6/rd2c_3c52yz5t3qgwjs5yrz40000gp/T/tmp1533377312/TestBasicCatchpointCatchup/Node has terminated with error code 66
--- FAIL: TestBasicCatchpointCatchup (240.30s)

@Eric-Warehime
Copy link
Contributor Author

Resolved in later version of #3920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant