Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestStoreRangeMergeRHSLeaseExpiration failed under stress #33656

Closed
cockroach-teamcity opened this issue Jan 11, 2019 · 8 comments · Fixed by #34934
Closed

storage: TestStoreRangeMergeRHSLeaseExpiration failed under stress #33656

cockroach-teamcity opened this issue Jan 11, 2019 · 8 comments · Fixed by #34934
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/f5e3c29b2eed92868cf3d449575283e2e383f199

Parameters:

TAGS=
GOFLAGS=

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestStoreRangeMergeRHSLeaseExpiration PKG=github.com/cockroachdb/cockroach/pkg/storage TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1089030&tab=buildLog

I190111 06:33:02.098226 31113 gossip/gossip.go:391  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:35633" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 
I190111 06:33:02.098400 31400 gossip/client.go:128  [n2] started gossip client to 127.0.0.1:33709
I190111 06:33:02.100682 31113 storage/replica_command.go:349  [s1,r1/1:/M{in-ax}] initiating a split of this range at key "b" [r2]
I190111 06:33:02.111477 31113 storage/store_snapshot.go:762  [s1,r1/1:{/Min-b}] sending preemptive snapshot 80d1dc9a at applied index 18
I190111 06:33:02.111651 31113 storage/store_snapshot.go:805  [s1,r1/1:{/Min-b}] streamed snapshot to (n2,s2):?: kv pairs: 20, log entries: 8, rate-limit: 2.0 MiB/sec, 0.00s
I190111 06:33:02.114715 31624 storage/replica_raftstorage.go:809  [s2,r1/?:{-}] applying preemptive snapshot at index 18 (id=80d1dc9a, encoded size=2633, 1 rocksdb batches, 8 log entries)
I190111 06:33:02.115565 31624 storage/replica_raftstorage.go:815  [s2,r1/?:{/Min-b}] applied preemptive snapshot in 1ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I190111 06:33:02.115940 31113 storage/replica_command.go:899  [s1,r1/1:{/Min-b}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r1:{/Min-b} [(n1,s1):1, next=2, gen=1]
I190111 06:33:02.117197 31113 storage/replica.go:3570  [s1,r1/1:{/Min-b},txn=26724b44] proposing ADD_REPLICA((n2,s2):2): updated=[(n1,s1):1 (n2,s2):2] next=3
I190111 06:33:02.256402 31113 storage/store_snapshot.go:762  [s1,r2/1:{b-/Max}] sending preemptive snapshot edf651ca at applied index 12
I190111 06:33:02.256648 31113 storage/store_snapshot.go:805  [s1,r2/1:{b-/Max}] streamed snapshot to (n2,s2):?: kv pairs: 42, log entries: 2, rate-limit: 2.0 MiB/sec, 0.00s
I190111 06:33:02.256973 31488 storage/replica_raftstorage.go:809  [s2,r2/?:{-}] applying preemptive snapshot at index 12 (id=edf651ca, encoded size=7524, 1 rocksdb batches, 2 log entries)
I190111 06:33:02.257376 31488 storage/replica_raftstorage.go:815  [s2,r2/?:{b-/Max}] applied preemptive snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I190111 06:33:02.257852 31113 storage/replica_command.go:899  [s1,r2/1:{b-/Max}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r2:{b-/Max} [(n1,s1):1, next=2, gen=0]
I190111 06:33:02.260382 31113 storage/replica.go:3570  [s1,r2/1:{b-/Max},txn=fc1bdb8f] proposing ADD_REPLICA((n2,s2):2): updated=[(n1,s1):1 (n2,s2):2] next=3
I190111 06:33:02.316116 31643 storage/replica_command.go:492  [s1,r1/1:{/Min-b}] initiating a merge of r2:{b-/Max} [(n1,s1):1, (n2,s2):2, next=3, gen=0] into this range
I190111 06:33:02.321623 31113 storage/client_test.go:1317  test clock advanced to: 1.800000125,0
I190111 06:33:02.334305 31498 storage/client_merge_test.go:1314  starting get 0
I190111 06:33:02.335355 31683 storage/client_merge_test.go:1314  starting get 1
I190111 06:33:02.336418 31499 storage/node_liveness.go:727  [s1,r2/1:{b-/Max}] incremented n2 liveness epoch to 2
I190111 06:33:02.336505 31684 storage/client_merge_test.go:1314  starting get 2
I190111 06:33:02.339798 31628 storage/client_merge_test.go:1314  starting get 3
I190111 06:33:02.340920 31629 storage/client_merge_test.go:1314  starting get 4
I190111 06:33:02.342232 31500 storage/client_merge_test.go:1314  starting get 5
I190111 06:33:02.343334 31501 storage/client_merge_test.go:1314  starting get 6
I190111 06:33:02.344437 31502 storage/client_merge_test.go:1314  starting get 7
I190111 06:33:02.345549 31672 storage/client_merge_test.go:1314  starting get 8
I190111 06:33:02.346654 31685 storage/client_merge_test.go:1314  starting get 9
I190111 06:33:02.413038 31512 storage/store.go:2658  [s2,r1/2:{/Min-b},txn=e1b0affd] removing replica r2/2
W190111 06:33:02.414080 31608 storage/raft_transport.go:583  while processing outgoing Raft queue to node 2: rpc error: code = Unavailable desc = transport is closing:
--- FAIL: TestStoreRangeMergeRHSLeaseExpiration (0.33s)
	client_merge_test.go:1343: expected RangeNotFound error from get during merge, but got <nil>

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Jan 11, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 11, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f5e3c29b2eed92868cf3d449575283e2e383f199

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestStoreRangeMergeRHSLeaseExpiration PKG=github.com/cockroachdb/cockroach/pkg/storage TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1089307&tab=buildLog

W190111 06:54:46.663808 31397 gossip/gossip.go:1499  [n2] no incoming or outgoing connections
I190111 06:54:46.665088 31397 gossip/gossip.go:391  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:38653" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 
I190111 06:54:46.666659 31605 gossip/client.go:128  [n2] started gossip client to 127.0.0.1:43003
I190111 06:54:46.746517 31397 storage/replica_command.go:349  [s1,r1/1:/M{in-ax}] initiating a split of this range at key "b" [r2]
I190111 06:54:46.764508 31397 storage/store_snapshot.go:762  [s1,r1/1:{/Min-b}] sending preemptive snapshot 309dc54f at applied index 18
I190111 06:54:46.765249 31397 storage/store_snapshot.go:805  [s1,r1/1:{/Min-b}] streamed snapshot to (n2,s2):?: kv pairs: 20, log entries: 8, rate-limit: 2.0 MiB/sec, 0.00s
I190111 06:54:46.766718 31597 storage/replica_raftstorage.go:809  [s2,r1/?:{-}] applying preemptive snapshot at index 18 (id=309dc54f, encoded size=2635, 1 rocksdb batches, 8 log entries)
I190111 06:54:46.769284 31597 storage/replica_raftstorage.go:815  [s2,r1/?:{/Min-b}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=2ms commit=0ms]
I190111 06:54:46.771263 31397 storage/replica_command.go:899  [s1,r1/1:{/Min-b}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r1:{/Min-b} [(n1,s1):1, next=2, gen=1]
I190111 06:54:46.777702 31397 storage/replica.go:3570  [s1,r1/1:{/Min-b},txn=3c6c04f4] proposing ADD_REPLICA((n2,s2):2): updated=[(n1,s1):1 (n2,s2):2] next=3
I190111 06:54:46.930239 31397 storage/store_snapshot.go:762  [s1,r2/1:{b-/Max}] sending preemptive snapshot de3b8c52 at applied index 12
I190111 06:54:46.931059 31397 storage/store_snapshot.go:805  [s1,r2/1:{b-/Max}] streamed snapshot to (n2,s2):?: kv pairs: 42, log entries: 2, rate-limit: 2.0 MiB/sec, 0.00s
I190111 06:54:46.931988 31609 storage/replica_raftstorage.go:809  [s2,r2/?:{-}] applying preemptive snapshot at index 12 (id=de3b8c52, encoded size=7524, 1 rocksdb batches, 2 log entries)
I190111 06:54:46.933283 31609 storage/replica_raftstorage.go:815  [s2,r2/?:{b-/Max}] applied preemptive snapshot in 1ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I190111 06:54:46.935151 31397 storage/replica_command.go:899  [s1,r2/1:{b-/Max}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r2:{b-/Max} [(n1,s1):1, next=2, gen=0]
I190111 06:54:46.944317 31397 storage/replica.go:3570  [s1,r2/1:{b-/Max},txn=0e7b624c] proposing ADD_REPLICA((n2,s2):2): updated=[(n1,s1):1 (n2,s2):2] next=3
I190111 06:54:47.053372 31490 storage/replica_command.go:492  [s1,r1/1:{/Min-b}] initiating a merge of r2:{b-/Max} [(n1,s1):1, (n2,s2):2, next=3, gen=0] into this range
I190111 06:54:47.080847 31397 storage/client_test.go:1317  test clock advanced to: 1.800000125,0
I190111 06:54:47.831231 31614 storage/client_merge_test.go:1314  starting get 0
I190111 06:54:47.832423 31686 storage/client_merge_test.go:1314  starting get 1
I190111 06:54:47.833677 31687 storage/client_merge_test.go:1314  starting get 2
I190111 06:54:47.834912 31668 storage/client_merge_test.go:1314  starting get 3
I190111 06:54:47.836169 31616 storage/client_merge_test.go:1314  starting get 4
I190111 06:54:47.837467 31277 storage/client_merge_test.go:1314  starting get 5
I190111 06:54:47.838695 31617 storage/client_merge_test.go:1314  starting get 6
I190111 06:54:47.839816 31615 storage/node_liveness.go:727  [s1,r2/1:{b-/Max}] incremented n2 liveness epoch to 2
I190111 06:54:47.839928 31688 storage/client_merge_test.go:1314  starting get 7
I190111 06:54:47.841202 31646 storage/client_merge_test.go:1314  starting get 8
I190111 06:54:47.842427 31669 storage/client_merge_test.go:1314  starting get 9
I190111 06:54:47.971211 31560 storage/store.go:2658  [s2,r1/2:{/Min-b},txn=b9f7b751] removing replica r2/2
I190111 06:54:47.971282 31453 storage/store.go:2658  [s1,r1/1:{/Min-b}] removing replica r2/1
--- FAIL: TestStoreRangeMergeRHSLeaseExpiration (1.41s)
	client_merge_test.go:1343: expected RangeNotFound error from get during merge, but got <nil>

@tbg
Copy link
Member

tbg commented Jan 11, 2019

Repros immediately on #33566, doesn't repro on the previous commit.

@tbg tbg assigned tbg and unassigned benesch Jan 11, 2019
@tbg
Copy link
Member

tbg commented Jan 11, 2019

Doesn't repro with master and this diff:

diff --git a/pkg/kv/txn_interceptor_heartbeat.go b/pkg/kv/txn_interceptor_heartbeat.go
index 2b133cc3a1..15ca50c299 100644
--- a/pkg/kv/txn_interceptor_heartbeat.go
+++ b/pkg/kv/txn_interceptor_heartbeat.go
@@ -178,7 +178,7 @@ func (h *txnHeartbeat) SendLocked(
                        ba.Txn.Key = anchor
                }

-               if !h.st.Version.IsActive(cluster.VersionLazyTxnRecord) {
+               if true || !h.st.Version.IsActive(cluster.VersionLazyTxnRecord) {
                        addedBeginTxn = true

                        // Set the key in the begin transaction request to the txn's anchor key.

so my assumption would be that the merge txn writing its txn record lazily somehow causes this problem.

@nvanbenschoten, could you take a look?

@tbg tbg assigned nvanbenschoten and unassigned tbg Jan 11, 2019
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jan 11, 2019

I've tracked this down to a flake in the test and not an incompatibility between merges and omitted BeginTxns. The test installs a TestingRequestFilter and waits for the first EndTransaction request it can find with a merge trigger. It assumes that this request will never result in a retriable error during evaluation. Whenever the test flakes, it's because the EndTransaction hits a transaction aborted error and retries. Essentially what we want it to replace the TestingRequestFilter with a TestingProposalFilter so we can wait for the first merge EndTransaction request that succeeds. Unfortunately, it's not that easy because we want the filter to block, but TestingProposalFilter is called with the replica mutex held.

@nvanbenschoten
Copy link
Member

@tbg I'm running out of time to hack on this. I understand the problem well enough now to not be worried. It's a bug in the test, not in the code, so I think I'd feel comfortable skipping the test until I can resolve it. What do you think?

@tbg
Copy link
Member

tbg commented Jan 11, 2019 via email

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 11, 2019
Informs cockroachdb#33656.

I'm not aware of any places where this is necessary for correctness,
but it is useful to prevent test flakes until tests are updated to
expect lazy transaction record creation.

Release note: None
@nvanbenschoten
Copy link
Member

@tbg I just opened #33674, which I confirmed deflakes this. I still plan on actually fixing the test when I can, but I figured that provides the easiest escape hatch in case any more of these flakes pop up in the next two weeks.

@tbg
Copy link
Member

tbg commented Jan 11, 2019

👍 already reviewed.

craig bot pushed a commit that referenced this issue Jan 11, 2019
33674: kv: provide option to request eager txn record creation r=nvanbenschoten a=nvanbenschoten

Informs #33656.

I'm not aware of any places where this is necessary for correctness, but it is useful to prevent test flakes until tests are updated to expect lazy transaction record creation.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 14, 2019
Closes cockroachdb#33656.

This fixes the root problem instead of sending BeginTxn requests to
appease the test.

This is an important change to make now because if we left merges
using BeginTxn requests in 19.1 then we wouldn't be able to remove
BeginTxn support in 19.2.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 19, 2019
Closes cockroachdb#33656.

This fixes the root problem instead of sending BeginTxn requests to
appease the test.

This is an important change to make now because if we left merges
using BeginTxn requests in 19.1 then we wouldn't be able to remove
BeginTxn support in 19.2.

Release note: None
craig bot pushed a commit that referenced this issue Feb 19, 2019
34934: storage: stop sending BeginTxn requests in merge transactions, remove EagerRecord r=nvanbenschoten a=nvanbenschoten

Closes #33656.

This fixes the root problem instead of sending BeginTxn requests to appease the test.

This is an important change to make now because if we left merges using BeginTxn requests in 19.1 then we wouldn't be able to remove BeginTxn support in 19.2.

The PR then removes the EagerRecord transaction option entirely.

35019: stats: temporarily disable automatic stats r=rytaft a=rytaft

This commit temporarily disables automatic statistics. We will
reenable automtic stats once it's confirmed that #34928 fixes
the performance issues.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
@craig craig bot closed this as completed in #34934 Feb 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
4 participants