Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: test timeouts / deadlocks on NetBSD after CL 232298 #42515

Closed
bcmills opened this issue Nov 11, 2020 · 26 comments
Closed

runtime: test timeouts / deadlocks on NetBSD after CL 232298 #42515

bcmills opened this issue Nov 11, 2020 · 26 comments
Assignees
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 11, 2020

(Issue forked from #42237; see #42237 (comment).)

2020-11-11T06:26:05-f2e186b/netbsd-amd64-9_0
2020-11-10T18:42:47-8f2db14/netbsd-386-9_0
2020-11-10T18:33:37-b2ef159/netbsd-386-9_0
2020-11-10T18:33:37-b2ef159/netbsd-arm-bsiegert
2020-11-10T15:05:17-81322b9/netbsd-arm64-bsiegert
2020-11-09T21:03:36-d495712/netbsd-amd64-9_0
2020-11-09T20:02:56-5e18135/netbsd-386-9_0
2020-11-09T19:46:24-a2d0147/netbsd-386-9_0
2020-11-09T19:00:00-01cdd36/netbsd-386-9_0
2020-11-09T16:09:16-a444458/netbsd-386-9_0
2020-11-09T15:20:26-cb4df98/netbsd-386-9_0
2020-11-09T14:17:30-f858c22/netbsd-386-9_0
2020-11-07T16:59:55-5e371e0/netbsd-amd64-9_0
2020-11-07T16:31:02-2c80de7/netbsd-386-9_0
2020-11-07T03:19:27-33bc8ce/netbsd-386-9_0
2020-11-06T20:49:11-5736eb0/netbsd-amd64-9_0
2020-11-06T19:42:05-362d25f/netbsd-386-9_0
2020-11-06T15:33:23-d21af00/netbsd-386-9_0
2020-11-05T20:35:00-2822bae/netbsd-arm64-bsiegert
2020-11-05T16:46:56-04b5b4f/netbsd-386-9_0
2020-11-05T16:46:56-04b5b4f/netbsd-amd64-9_0
2020-11-05T15:16:57-34c0969/netbsd-386-9_0
2020-11-05T14:54:35-74ec40f/netbsd-386-9_0
2020-11-05T00:21:39-c018eec/netbsd-386-9_0
2020-11-04T21:45:25-fd841f6/netbsd-amd64-9_0
2020-11-04T16:54:48-594b4a3/netbsd-386-9_0
2020-11-04T15:53:19-5f0fca1/netbsd-386-9_0
2020-11-04T06:12:33-8eb846f/netbsd-amd64-9_0
2020-11-03T23:05:51-e1b305a/netbsd-386-9_0
2020-11-03T04:11:02-974def8/netbsd-386-9_0
2020-11-03T00:50:57-ebc1b8e/netbsd-amd64-9_0
2020-11-02T21:08:14-4fcb506/netbsd-386-9_0
2020-11-02T21:00:57-d1efaed/netbsd-arm64-bsiegert
2020-11-02T03:03:16-0387bed/netbsd-amd64-9_0
2020-11-01T13:23:48-0be8280/netbsd-arm64-bsiegert
2020-10-31T08:41:25-f14119b/netbsd-amd64-9_0
2020-10-31T00:35:18-79fb187/netbsd-arm64-bsiegert
2020-10-30T22:21:02-64a9a75/netbsd-arm-bsiegert
2020-10-30T21:14:09-f96b62b/netbsd-386-9_0
2020-10-30T21:14:09-f96b62b/netbsd-amd64-9_0
2020-10-30T21:02:17-420c68d/netbsd-386-9_0
2020-10-30T20:20:58-6abbfc1/netbsd-386-9_0
2020-10-30T18:06:13-6d087c8/netbsd-386-9_0
2020-10-30T18:05:53-36d412f/netbsd-amd64-9_0
2020-10-30T18:01:54-fb184a3/netbsd-amd64-9_0
2020-10-30T17:54:57-e02ab89/netbsd-amd64-9_0
2020-10-30T16:29:11-1af388f/netbsd-amd64-9_0
2020-10-30T16:20:05-2b9b272/netbsd-386-9_0
2020-10-30T16:20:05-2b9b272/netbsd-amd64-9_0
2020-10-30T00:23:50-faa4426/netbsd-386-9_0
2020-10-30T00:13:25-60f42ea/netbsd-386-9_0
2020-10-30T00:03:40-01efc9a/netbsd-386-9_0
2020-10-29T22:45:29-f588974/netbsd-amd64-9_0
2020-10-29T22:44:49-f43e012/netbsd-386-9_0
2020-10-29T21:46:54-fe70a3a/netbsd-386-9_0
2020-10-29T18:26:42-0b798c4/netbsd-amd64-9_0
2020-10-29T15:27:22-68e30af/netbsd-386-9_0
2020-10-29T15:13:09-50af50d/netbsd-amd64-9_0
2020-10-29T15:11:47-ecb79e8/netbsd-386-9_0
2020-10-29T13:53:33-c45d780/netbsd-amd64-9_0
2020-10-29T08:08:26-d9725f5/netbsd-386-9_0
2020-10-29T08:00:50-8b51798/netbsd-386-9_0
2020-10-29T03:23:51-15f01d6/netbsd-386-9_0
2020-10-29T01:50:09-308ec22/netbsd-386-9_0
2020-10-29T00:07:35-c1afbf6/netbsd-386-9_0
2020-10-28T17:54:13-fc116b6/netbsd-amd64-9_0
2020-10-28T17:10:08-642329f/netbsd-386-9_0
2020-10-28T17:08:06-e3c58bb/netbsd-amd64-9_0
2020-10-28T16:17:54-421d4e7/netbsd-386-9_0
2020-10-28T14:25:56-b85c2dd/netbsd-amd64-9_0
2020-10-28T13:25:44-72dec90/netbsd-386-9_0
2020-10-28T05:02:44-150d244/netbsd-386-9_0
2020-10-28T04:20:39-02335cf/netbsd-amd64-9_0
2020-10-28T01:03:23-368c401/netbsd-386-9_0
2020-10-27T23:12:41-5d3666e/netbsd-386-9_0
2020-10-27T22:13:30-091257d/netbsd-arm-bsiegert
2020-10-27T21:29:13-009d714/netbsd-amd64-9_0
2020-10-27T21:28:53-933721b/netbsd-amd64-9_0
2020-10-27T20:22:56-9113d8c/netbsd-amd64-9_0
2020-10-27T20:04:19-de2d1c3/netbsd-amd64-9_0
2020-10-27T20:03:41-5c1122b/netbsd-amd64-9_0
2020-10-27T20:03:12-79a3482/netbsd-386-9_0
2020-10-27T19:52:40-c515852/netbsd-amd64-9_0
2020-10-27T18:38:48-f0c9ae5/netbsd-amd64-9_0
2020-10-27T18:38:48-f0c9ae5/netbsd-arm64-bsiegert
2020-10-27T18:13:59-3f6b1a0/netbsd-amd64-9_0

@bcmills bcmills added OS-NetBSD NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 11, 2020
@bcmills bcmills added this to the Go1.16 milestone Nov 11, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Nov 11, 2020

2020-10-14T08:05:58-fc3a6f4/netbsd-amd64-9_0

has similar symptoms from before CL 232298, but it's the only such occurrence, and the failure rate is markedly higher since then.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 11, 2020

@bsiegert
Copy link
Contributor

/cc @tklauser

This issue together with #42422 means that the NetBSD builders are much more flaky than we would like :(

@prattmic prattmic self-assigned this Nov 30, 2020
@prattmic
Copy link
Member

Still ongoing, as expected:

2020-11-30T17:45:06-c193279/netbsd-amd64-9_0
2020-11-27T20:31:33-91f77ca/netbsd-amd64-9_0
2020-11-27T20:31:33-91f77ca/netbsd-arm-bsiegert
2020-11-26T05:00:21-f0ff6d4/netbsd-amd64-9_0
2020-11-25T19:46:00-4481ad6/netbsd-amd64-9_0
2020-11-25T15:59:35-df68e01/netbsd-arm64-bsiegert
2020-11-25T02:51:30-1308f11/netbsd-386-9_0
2020-11-24T21:47:20-ba2adc2/netbsd-386-9_0
2020-11-24T07:07:06-6965b01/netbsd-386-9_0
2020-11-24T07:07:06-6965b01/netbsd-amd64-9_0
2020-11-24T03:06:15-7dc5d90/netbsd-386-9_0
2020-11-23T19:58:25-48a1a51/netbsd-amd64-9_0
2020-11-21T16:46:05-9ea6364/netbsd-386-9_0
2020-11-21T16:46:05-9ea6364/netbsd-arm64-bsiegert
2020-11-21T03:29:37-f93ef07/netbsd-386-9_0
2020-11-20T17:31:50-c306fd6/netbsd-386-9_0
2020-11-20T04:21:00-66c0264/netbsd-386-9_0
2020-11-20T02:27:53-0dcc7d6/netbsd-amd64-9_0
2020-11-20T00:09:05-7eed73f/netbsd-386-9_0
2020-11-19T20:37:38-59f5fda/netbsd-386-9_0
2020-11-19T20:37:38-59f5fda/netbsd-amd64-9_0
2020-11-19T19:30:38-e73697b/plan9-arm
2020-11-19T04:02:42-ff2824d/netbsd-386-9_0
2020-11-19T02:17:10-0bb6115/netbsd-386-9_0
2020-11-19T01:27:31-96b943a/netbsd-amd64-9_0
2020-11-18T22:12:57-5b0ec1a/netbsd-386-9_0
2020-11-18T20:12:03-b4f3d52/netbsd-386-9_0
2020-11-18T20:12:03-b4f3d52/netbsd-arm64-bsiegert
2020-11-18T19:09:36-ae76f6e/netbsd-amd64-9_0
2020-11-18T14:45:14-b194b51/netbsd-386-9_0
2020-11-18T12:04:06-399b5d1/netbsd-386-9_0
2020-11-18T04:43:50-bcfaeca/netbsd-386-9_0
2020-11-18T04:43:50-bcfaeca/netbsd-amd64-9_0
2020-11-17T18:28:55-01df2fe/netbsd-386-9_0
2020-11-17T15:10:45-0968d2d/netbsd-amd64-9_0
2020-11-17T13:20:20-0ae3b7c/netbsd-386-9_0
2020-11-16T22:24:14-869e295/netbsd-386-9_0
2020-11-16T18:39:54-38367d0/netbsd-386-9_0
2020-11-13T23:26:54-4f63e0a/netbsd-arm-bsiegert
2020-11-13T22:01:37-86954d5/netbsd-386-9_0
2020-11-12T22:50:40-30ba798/netbsd-386-9_0
2020-11-12T21:21:29-f016172/netbsd-amd64-9_0
2020-11-12T21:21:18-60b1253/netbsd-386-9_0
2020-11-12T14:17:24-e75aef8/netbsd-386-9_0
2020-11-12T14:17:24-e75aef8/netbsd-amd64-9_0
2020-11-12T13:40:55-4bc5f6f/netbsd-amd64-9_0
2020-11-11T16:22:54-2843754/netbsd-386-9_0
2020-11-11T16:22:54-2843754/netbsd-amd64-9_0
2020-11-11T06:26:05-f2e186b/netbsd-386-9_0

@prattmic
Copy link
Member

prattmic commented Dec 4, 2020

This seems to be reproducible on gomote (~20% of the time) with:

$ gomote push ${MOTE?}
$ gomote run ${MOTE?} ./go/src/make.bash
$ gomote run ${MOTE?} ./go/bin/go test -count=1 -short -timeout=2m -run=TestNoShrinkStackWhileParking -trace ./trace.out -o ./runtime.test runtime

@prattmic
Copy link
Member

prattmic commented Dec 4, 2020

This issue is reminding me a bit of hard-to-reproduce issues we've been tracking with worldsema for a while. That is, the test is getting work done, and then just ... stops:

Screenshot 2020-12-04 at 17 43 13

Two minutes later forcegchelper runs and lo and behold, the test starts making progress again (G30):

Screenshot 2020-12-04 at 17 44 25

Note necessarily related, but suspicious.

cc @aclements @mknyszek

@prattmic
Copy link
Member

prattmic commented Dec 7, 2020

Update for the day: I've narrowed down the issue to findrunnable finding a timer to wait on, deciding to netpollBreak to kick the poller... and then nothing. Either netpollBreak isn't waking the poller or there isn't actually a poller.

Hopefully, I'll be able to narrow this further tomorrow.

@prattmic
Copy link
Member

prattmic commented Dec 7, 2020

There is a poller, but it is not woken by netpollBreak. Here's the flow:

# netpoll with 180s delta (test timeout)
[0.753745874 P -1] M 4 P ? now 16465974762371 netpoll poll delay 179247420027

# Successfully write to netpoll break FD
[0.753788314 P -1] M 5 P ? netpollBreak: write =  1

# A different (non-blocking) netpoll receives the netpollBreak readiness, but doesn't clear netpollWakeSig
# because this is a non-blocking call: https://cs.opensource.google/go/go/+/master:src/runtime/netpoll_kqueue.go;l=149-156
[0.753803005 P 2] M 7 P ? now 16465974819894 netpoll poll delay 0
[0.753806724 P 2] M 7 P ? now 16465974823551 netpoll poll delay 0 duration 3657 n 1
[0.753808903 P 2] M 7 P ? now 16465974825645 netpoll received netpollBreak delay 0

# Later netpollBreak calls don't write to FD because netpollWakeSig = 1
[0.754030446 P 0] M 0 P ? netpollBreak wakeSig 1

# 180s later netpoll finally completes, does not receive the netpollBreak readiness (n = 1)
[180.253550882 P -1] M 4 P ? now 16645474571844 netpoll poll delay 179247420027 duration 179499809473 n 0

The 180s netpoll should have woken from netpollBreak but did not. I see two primary possibilities:

  1. The NetBSD kernel is treating this event as edge-triggered with exactly-once delivery. It delivered a notification to the non-blocking netpoll (which ignored it), so it didn't deliver to the blocking call as well.
  2. Delivery of this event is simply flaky in the kernel in general.

I suppose I'll take a look at the kernel source tomorrow...

@prattmic
Copy link
Member

prattmic commented Dec 7, 2020

Or (3) the event is edge-triggered and the kqueue call was slightly too late. Note that the netpollBreak write is very close in time to the netpoll call (which I'm logging before the call).

@prattmic
Copy link
Member

prattmic commented Dec 8, 2020

@bsiegert are you aware of any recent issues with the NetBSD kernel that may cause the behavior described in #42515 (comment)?

To add further details, all following calls to kevent after the netpollBreak are getting the break event returned, so possibility (1) doesn't seem right. (Note all these calls are non-blocking). It is just the blocking kevent call that occurred right around the same time as the pipe write that isn't getting an event returned.

The source all seems OK to me:

  1. kqueue lock is taken.
  2. If the event is already in the queue, then we should get it (though there is a lot of logic there).
  3. If the event isn't in the queue, wait on the condition variable.

This seems fine, but makes me wonder if there is some race where the condition variable isn't firing via a racing broadcast. The kqueue lock seems to prevent this, but who knows.

@prattmic
Copy link
Member

prattmic commented Dec 10, 2020

Looking more closely at the write and kevent, I can get time uncertainty bounds [1] like:

kevent: [2197268650624, 2376959698401]
write:  [2197268677873, 2197268710147]

The write bound start is after the kqueue bound start, so the race is close enough that we can't really tell which came first, which is a bit disappointing. Though recall that based on the kevent API contract it shouldn't matter which came first regardless. The kevent should return the event.

[1] That is, start time is some time before the syscall was made. end time is some time after the syscall returned. But we can't be sure exactly when the call occurred inbetween.

@prattmic
Copy link
Member

I've attempted to reproduce this issue on my own NetBSD VM, based on the gomote build script https://cs.opensource.google/go/x/build/+/master:env/netbsd-386/.

Unfortunately that script builds NetBSD 9.0_2019Q4, which seems to no longer exist and thus doesn't build. The closest I could try was 9.0_2020Q2 and 9.1. I've been unable to reproduce the problem on either of these versions.

It is possible that something has changed since 9.0_2019Q2 to fix the issue, or perhaps my environment is simply different enough somehow to avoid triggering this issue.

If it is the former, then perhaps the fix is just to update the builders (cc @golang/release), though I don't understand NetBSD's release policy well enough to know if 9.0_2019Q2 is a supported release that we want to keep working (cc @bsiegert).

@dmitshur dmitshur added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 10, 2020
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/277033 mentions this issue: WIP: debug #42515

@prattmic
Copy link
Member

Ah, after many hours of waiting I did manage to reproduce this on i386 9.0_2020Q2. Interestingly, the behavior is similar but not identical:

# Older kevent gets a netpollBreak.
[1.416594591 P -1] M 6 P ? now 17564795668567 netpoll exit poll delay 178584895297 real start 17564795248432 duration 420135 n 1
[1.416597333 P -1] M 6 P ? now 17564795672846 netpoll received netpollBreak delay 178584895297

# New write to netpollBreak (there is no poller at this precise moment (racing with above)).
[1.416607232 P 0] M 4 P ? netpollBreak: write =  1 start 17564795679978 end 17564795681250 duration 1272

# Non-blocking kevent receives break.
[1.435770113 P 1] M 0 P ? now 17564814845535 netpoll enter poll delay 0
[1.435775386 P 1] M 0 P ? now 17564814849491 netpoll exit poll delay 0 real start 17564814846159 duration 3332 n 1
[1.435777146 P 1] M 0 P ? now 17564814852322 netpoll received netpollBreak delay 0

# Blocking kevent (on the same M!) _doesn't_ receive netpoll break.
[1.436044372 P -1] M 0 P ? now 17564815119711 netpoll enter poll delay 178565017725
...
[180.401118534 P -1] M 0 P ? now 17743780191203 netpoll exit poll delay 178565017725 real start 17564815120557 duration 178965070646 n 0

This can't be a race in setup in kevent since the same thread called kevent twice and received the event the first time but not the second.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/277332 mentions this issue: netbsd: limit maximum netbsd kevent delay

@prattmic
Copy link
Member

I've pushed a workaround discussed with @aclements that could mitigate this issue, but it's pretty nasty, so I'd still rather find the actual cause.

@prattmic
Copy link
Member

I've finally managed to get an i386 VM running a custom NetBSD kernel that definitely fails at c98ec4120ecf0b9a29bf31c1b00d7896536b7d76 (original 9.0 commit) and seems to pass at f15eb9e6ad34c3315d354274fd26356e3ae79d84 (HEAD), so I'll try to do a bisect.

@prattmic
Copy link
Member

prattmic commented Dec 16, 2020

I found the bug!

First thing first: the bisect was a red herring. I didn't give HEAD enough time to reproduce the issue. It turns out this is still failing on trunk NetBSD.

The issue is a race in kevent(2) after all. The bad flow is:

  1. Lock kq on entry.
  2. Remove head entry: https://github.com/NetBSD/src/blob/b9fe321c6a0b02f7519a6649de7ceaf4e71af02c/sys/kern/kern_event.c#L1478
  3. Unlock kq, perform fs call, lock kq: https://github.com/NetBSD/src/blob/b9fe321c6a0b02f7519a6649de7ceaf4e71af02c/sys/kern/kern_event.c#L1488-L1495
  4. Add entry back to queue: https://github.com/NetBSD/src/blob/b9fe321c6a0b02f7519a6649de7ceaf4e71af02c/sys/kern/kern_event.c#L1553

Step 3 is the problem here: once kq is unlocked another kevent call can come along and miss this entry altogether. If that is a blocking call and there are no entries left, then the call will even block and only wake when a completely new event occurs or timeout (i.e., step 4 doesn't notify the condvar [1]).

To verify this theory, I've added tracepoints to kqueue_scan, which can be seen in prattmic/netbsd-src@2cd2ba9.

The interesting part of the trace is below. Note that the first two columns are PID and TID, respectively. The format at the end of the line is trace_name: <num bytes>, <bytes displayed big endian>. Commentary inline.

# Write to netpollBreak FD.
  4132   3656 runtime.test 1608155897.373396927 GIO   fd 5 wrote 1 bytes
       "\0"

# Irrelevant.
  4132   3656 runtime.test 1608155897.373427571 PSIG  SIGURG caught handler=0x80b9890 mask=(): code=SI_LWP sent by pid=4132, uid=0)

# Non-blocking kevent by 3675, which receives the netpollBreak event (retval = 1).
  4132   3675 runtime.test 1608155897.373500103 MISC  kq_scan_timeout: 4, ffffffff
  4132   3675 runtime.test 1608155897.373500735 MISC  kq_scan_lock: 1, 01
  4132   3675 runtime.test 1608155897.373501022 MISC  kq_scan_count: 4, 01000000
  4132   3675 runtime.test 1608155897.373501298 MISC  kq_scan_unlock: 1, 02
  4132   3675 runtime.test 1608155897.373501595 MISC  kq_scan_lock: 1, 02
  4132   3675 runtime.test 1608155897.373501895 MISC  kq_scan_unlock: 1, 03
  4132   3675 runtime.test 1608155897.373502336 MISC  kq_scan_lock: 1, 03
  4132   3675 runtime.test 1608155897.373502697 MISC  kq_scan_unlock: 1, 07
  4132   3675 runtime.test 1608155897.373503109 MISC  kq_scan_retval: 4, 01000000

# Racing non-blocking and blocking calls by 3656 and 3675, respectively.
  4132   3656 runtime.test 1608155897.392784451 MISC  kq_scan_timeout: 4, ffffffff
  4132   3675 runtime.test 1608155897.392785991 MISC  kq_scan_timeout: 4, 4e460000

# 3656 locks and sees event.
  4132   3656 runtime.test 1608155897.392786080 MISC  kq_scan_lock: 1, 01
  4132   3656 runtime.test 1608155897.392786876 MISC  kq_scan_count: 4, 01000000

# Lock cycle; won't cause this bug, but could make a blocking call return 0 without blocking.
  4132   3656 runtime.test 1608155897.392787560 MISC  kq_scan_unlock: 1, 02
  4132   3656 runtime.test 1608155897.392788778 MISC  kq_scan_lock: 1, 02

# Between lock 2 and unlock 3, the event is removed from the queue.
  4132   3656 runtime.test 1608155897.392789390 MISC  kq_scan_unlock: 1, 03

# 3675 manages to grab lock. Since the event is transiently removed, it sees no events
# and blocks on the condvar (which implicitly unlocks).
  4132   3675 runtime.test 1608155897.392790016 MISC  kq_scan_lock: 1, 01
  4132   3675 runtime.test 1608155897.392790407 MISC  kq_scan_count: 4, 00000000

# 3656 reacquires the lock, adds the event back to the queue, then unlocks and returns.
  4132   3656 runtime.test 1608155897.392791932 MISC  kq_scan_lock: 1, 03
  4132   3656 runtime.test 1608155897.392792717 MISC  kq_scan_unlock: 1, 07
  4132   3656 runtime.test 1608155897.392793325 MISC  kq_scan_retval: 4, 01000000

...

# condvar wait times out 180s later.
  4132   3675 runtime.test 1608156077.372309293 MISC  kq_scan_unlock: 1, 01
  4132   3675 runtime.test 1608156077.372311582 MISC  kq_scan_retval: 4, 00000000

I'll file a NetBSD bug once I figure out how. This bug does not seem to be new; as far as I can tell it dates back to the addition of kqueue locking (NetBSD/src@c743ad7). It is exposed by removal of sysmon as a backstop for overrun timers. Thus, I think http://golang.org/cl/277332 is still the best workaround [2].

It is unclear to me how difficult this is to fix, as it is not clear to me which, if any, constraints require unlocking the kq.

For completeness, I took a look at two other BSDs to see if they are affected:

OpenBSD: Has a Big Kernel Lock at syscall level, so no kevent locking: https://github.com/openbsd/src/blob/e20c779da119f998aba452410e439df1275cd7e8/sys/sys/syscall_mi.h#L92-L104
FreeBSD: Seems to use marker notes to notice that the queue is "in flux" and wait for the mutator to finish: https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1859-L1869

[1] N.B. putting a notification after step 4 would fix the problem for Go, but it would still be generally incorrect because calls that don't block could still return with missing events in violation of the API.
[2] The only other workaround I see is to add our own locking around kevent calls, but that seems much worse for scalability.

@toothrot toothrot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 17, 2020
@prattmic
Copy link
Member

prattmic commented Dec 17, 2020

Minimal C reproducer: https://gist.github.com/prattmic/8b5bc6c87437bd4496d5f546fb3226fc

Pretty simple, just two threads concurrently polling the same kqueue. One of them usually misses an event within a few hundred iterations.

Don't run this under ktrace with my kernel changes or the machine will panic! (OOM I guess?)

@prattmic
Copy link
Member

Sent update to https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=50094, turns out this issue was reported in 2015!

Our strategy here should be to get http://golang.org/cl/277332 in and hopefully we can remove that one day when NetBSD is fixed.

@bsiegert
Copy link
Contributor

bsiegert commented Dec 17, 2020 via email

gopherbot pushed a commit that referenced this issue Dec 21, 2020
The netbsd kernel has a bug [1] that occassionally prevents netpoll from
waking with netpollBreak, which could result in missing timers for an
unbounded amount of time, as netpoll can't restart with a shorter delay
when an earlier timer is added.

Prior to CL 232298, sysmon could detect these overrun timers and
manually start an M to run them. With this fallback gone, the bug
actually prevents timer execution indefinitely.

As a workaround, we add back sysmon detection only for netbsd.

[1] https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=50094

Updates #42515

Change-Id: I8391f5b9dabef03dd1d94c50b3b4b3bd4f889e66
Reviewed-on: https://go-review.googlesource.com/c/go/+/277332
Run-TryBot: Michael Pratt <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
Trust: Michael Pratt <[email protected]>
@prattmic
Copy link
Member

With http://golang.org/cl/277332, the immediate problem is solved. Further cleanup is blocked by upstream fix to https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=50094, so I'll close this for for now.

@prattmic
Copy link
Member

Upstream change NetBSD/src@7fb7f43 should address this issue. I haven't had a chance to test it, but according to Jaromir it passes the C repro I created (which was more reliable than Go anyways).

@jdolecek-zz
Copy link

The fix in NetBSD will be pulled up to release branch once the change is stabilizied. Eventually the sysmon conditional can be removed for good and refer the go users to use the latest NetBSD release. I'll create a ticket for go once that it possible.

Removing the conditional would be highly preferable, as it's not at all good to be the only platform doing the timer kick from sysmon - not just due the extra latency only experienced on older NetBSD, but there might be some interference due to locking even with fixed kernel due to the timers being checked from sysmon.

@krytarowski
Copy link
Contributor

@prattmic great work on it!

netbsd-srcmastr pushed a commit to NetBSD/src that referenced this issue Feb 4, 2021
	sys/kern/kern_event.c	r1.110-1.115 (via patch)

fix a race in kqueue_scan() - when multiple threads check the same
kqueue, it could happen other thread seen empty kqueue while kevent
was being checked for re-firing and re-queued

make sure to keep retrying if there are outstanding kevents even
if no kevent is found on first pass through the queue, and only
kq_count when actually completely done with the kevent

PR kern/50094 by Christof Meerwal

Also fixes timer latency in Go, as reported in
golang/go#42515 by Michael Pratt
ryo pushed a commit to IIJ-NetBSD/netbsd-src that referenced this issue Feb 5, 2021
	sys/kern/kern_event.c	r1.110-1.115 (via patch)

fix a race in kqueue_scan() - when multiple threads check the same
kqueue, it could happen other thread seen empty kqueue while kevent
was being checked for re-firing and re-queued

make sure to keep retrying if there are outstanding kevents even
if no kevent is found on first pass through the queue, and only
kq_count when actually completely done with the kevent

PR kern/50094 by Christof Meerwal

Also fixes timer latency in Go, as reported in
golang/go#42515 by Michael Pratt
netbsd-srcmastr pushed a commit to NetBSD/src that referenced this issue May 1, 2021
	sys/kern/kern_event.c	r1.110-1.115 (via patch)

fix a race in kqueue_scan() - when multiple threads check the same
kqueue, it could happen other thread seen empty kqueue while kevent
was being checked for re-firing and re-queued

make sure to keep retrying if there are outstanding kevents even
if no kevent is found on first pass through the queue, and only
kq_count when actually completely done with the kevent

PR kern/50094 by Christof Meerwal

Also fixes timer latency in Go, as reported in
golang/go#42515 by Michael Pratt
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/324472 mentions this issue: runtime: skip sysmon workaround on NetBSD >= 9.2

gopherbot pushed a commit that referenced this issue Aug 16, 2021
Detect the NetBSD version in osinit and only enable the workaround for
the kernel bug identified in #42515 for NetBSD versions older than 9.2.

For #42515
For #46495

Change-Id: I808846c7f8e47e5f7cc0a2f869246f4bd90d8e22
Reviewed-on: https://go-review.googlesource.com/c/go/+/324472
Trust: Tobias Klauser <[email protected]>
Trust: Benny Siegert <[email protected]>
Run-TryBot: Tobias Klauser <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
TryBot-Result: Go Bot <[email protected]>
netbsd-srcmastr pushed a commit to NetBSD/src that referenced this issue Nov 8, 2021
	sys/kern/kern_event.c	r1.110-1.115 (via patch)

fix a race in kqueue_scan() - when multiple threads check the same
kqueue, it could happen other thread seen empty kqueue while kevent
was being checked for re-firing and re-queued

make sure to keep retrying if there are outstanding kevents even
if no kevent is found on first pass through the queue, and only
kq_count when actually completely done with the kevent

PR kern/50094 by Christof Meerwal

Also fixes timer latency in Go, as reported in
golang/go#42515 by Michael Pratt
@golang golang locked and limited conversation to collaborators Jun 3, 2022
@prattmic prattmic self-assigned this Jun 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants