-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: costfuzz/seed-multi-region failed #90683
Comments
Seems related to #90504. Seeing these in the node 1 kv-distribution log:
And the setup for the test has these lines:
|
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ d39558f51ded0411e90ef1d8aa05106930b36667:
Parameters: |
Most recent failure looks like the same issue. |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 0c1c3e7777b28a30ebe41428fb173f0156e8968c:
Parameters: |
ditto |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 0c1c3e7777b28a30ebe41428fb173f0156e8968c:
Parameters: |
Ditto. This one has |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 2d926e68000df659f282d4e4477329867b9a3323:
Parameters: |
Same issue. Removing release-blocker label. |
I looked at the last OOM. The memory profile shows 3GB of GC heap, that's nominal, this is interesting:
So we're using 14.9 GB of memory and 12.1 GB is anonymous huge pages. I don't know about the go allocator but in the past jemalloc didn't jive well with THP. I wonder if we should try disabling it:
There's an option to dynamically disable it at runtime and we could disable it at the OS level too. I'll try running it with THP off, how frequently is this test failing? Its a nightly right? |
See cockroachdb#90683 Release note: None
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ bec24eb9cfe2cc48c45772f1e88d387fe37944c9:
Parameters: |
Most recent failure is the same as the others. |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 9c9d55d707ad9a768027e9b7a3775c7c7cde8de7:
Parameters: |
This last one had a query here at the end:
Maybe costfuzz just caused a query to run that took 30 minutes to run, oh I see we have a 1m statement timeout set, so this must be a case where we are missing a cancel checker. |
The 11/2 failure (2nd to last) had a similar query running.
|
diskQueue Dequeue and Enqueue both have cancel checkers so they must have been dealing with humonguous batches or there was some kind of disk stall? Given that they appear to be doing work and not reading/writing to disk I think they must be huge batches? @yuzefovich any thoughts? The fact that the failure on 11/1 was an OOM lends credence to the huge batches theory. |
Hm, if there was a prolonged disk stall, we would crash the node (I think right now it happens if we detect 20 second disk stall). Indeed, these goroutines show that we are doing work for serializing / deserializing possibly-large batches to write / read them from the disk, but it's hard to say for sure whether this temp storage activity the cause here. I wonder whether we should augment the signature of |
91561: opt: prevent null-rejection rule cycle r=DrewKimball a=DrewKimball This patch adds checks to the null-rejection logic to ensure that null-rejection is not attempted when the necessary filter push-down rules are disabled. This prevents rule cycles that occur when decorrelation or filter elimination rules fire after the `col IS NOT NULL` filter isn't pushed all the way down to the outer join. Fixes #89986 Release note: None 91588: kvserver: make logAppend and sideloading stand-alone r=pavelkalinnikov a=tbg `*Replica` is a big bag of behavior and we should try to encapsulate the replication layer from it as much as possible. Methods anchored on `*Replica` are particularly problematic, since it's anyone's guess which of the many members they access. Instead, the pattern we should follow is that as many methods as possible are stand-alone and interact with the `*Replica` only ways obvious from the parameters. It was really easy to do this for `(*Replica).append` and for `(*Replica).maybeSideloadEntries`, and this commit does it. Epic: CRDB-220 Release note: None 91631: kvserver: fix allocator determinism r=nvanbenschoten a=kvoli Previously, the allocator when running in deterministic mode, would not be deterministic under all scenarios due to iteration over an unordered map. This patch fixes this by first sorting entries, before iterating. resolves: #89394 Release note: None 91632: roachtest: gather cores if requested r=cucaroach a=cucaroach - roachtest: allow tests to opt in to core gathering and compress cores - roachtest: gather cores on multiregion tests that are timing out Relates to #90683 CRDB-20887 Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Tommy Reilly <[email protected]>
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ d136c1fa50fa507c2232d7b215d98b88a83649c5:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ fb4014a31b9b8d8235dc48de52196e64b185f490:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ af18b5d2f8ec09d6f3c3092be9ef5fe3f460724d:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ d98d195dfa7a3480ce3e07657aa870692fe71cea:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ ea7c52e5a930be203318db6e338edd5dde7537bd:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 1a6e9f885baa124d5ff2996adb966ea15a1a9b2b:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 7823e2f75474417a44f0d34f76c5f8914cb91d52:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 9c21578450e395c83a1dc0df7090296fef06e006:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ d6f98e90684894fd36f53596e6aac355676d232e:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ dcdf599e9eaee8010137167f419509ccb7627406:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 4232883add85a151c423c45904ac4096d04656c5:
Parameters: |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 9c5375f6a7375724cdbcbaa0029ed97a230d7abe:
Parameters: |
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
I hid all the failures above that are duplicates of #94520. |
@msirek are these context deadline exceeded issues related the fixes you made recently? |
Yes. Closing as a dup of #92753 |
roachtest.costfuzz/seed-multi-region failed with artifacts on master @ 1b1c8da55be48c174b7b370b305f42622546209f:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-20887
The text was updated successfully, but these errors were encountered: