-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest/cdc/mixed-versions: use mixed version framework #108802
roachtest/cdc/mixed-versions: use mixed version framework #108802
Conversation
a5c752a
to
3b0a059
Compare
I manually ran the test and verified it all works. |
3b0a059
to
b583882
Compare
// when using the validator (hence the "Unsafe" API). A mixed version state | ||
// during validation is okay, but connections should not be able to be torn | ||
// down while performing validations. | ||
type validatorWithDBOverride struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on this struct:
- Not sure I understand what the "DBOverride" part of the name is supposed to mean. Is there a DB being overriden?
- Do we need to keep a reference to the fingerprint validator? The
CountValidator
already delegates its validation calls. - Do we need to name things
unsafe*
? It adds a certain unease to interacting with this API that I think is unnecessary. Documenting that the validator should only be used when nodes are not restarting (like you did) is, in my opinion, enough. By construction, this should hold because we're only using the validatorInMixedVersion
andAfterUpgradeFinalized
. I don't think that havingUnsafe
on the name makes misuse less likely, especially because it's unclear what is unsafe about it.
|
||
// kafkaBufferMessageSize is the number of messages from kafka | ||
// we allow to be buffered in memory before validating them. | ||
kafkaBufferMessageSize = 8192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this will lead to flakes. I remember that when I wrote the initial version of this mixed-version cdc test, we were seeing failures because CDC was emitting thousands of old events when nodes restarted (#89332). If that were to happen again, I can see the buffer overgrowing this limit even though there's no bug per se?
Thanks for working on this! I think coverage of mixed-version for CDC will improve as we add more mixed-version scenarios to the framework (multiple upgrades are coming soon -- #108773) without having to change the test (too much). Would be nice to add some randomness to many aspects of this test. Some ideas:
Thoughts? Not for this PR, of course, just thought it could be a good time to make these changes. |
b583882
to
27f2475
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Your PR is great. I'm glad that we will be able to provide a min version to start from.
I can write up a follow up issue to add these uses of rng. They look good!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, @rachitgsrivastava, and @renatolabs)
pkg/cmd/roachtest/roachtestutil/mixedversion/helper.go
line 111 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I believe this implementation is based on the existing
BackgroundCommand
helper. I kind of regret having that helper and planned to remove it (there are currently no callers; it served a purpose at some point, but no longer).It's nice that it logs the command before running it, but
roachtest
itself will start doing it soon [1] so, if anything, this will log commands twice which could be confusing.Maybe a helper is not needed -- the
cdcMixedVersionTester
can easily keep a reference to the cluster and use that when running commands. What do you think?
I removed it and added a reference to the cluster in cdcMixedVersionTester
. 👍
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 170 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Do we really need to make this public in this PR? All of the current references could be removed if the methods on
cdcMixedVersionTester
implementeduserFunc
directly. That would even make them more readable, IMO.
Done.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 49 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I wonder if this will lead to flakes. I remember that when I wrote the initial version of this mixed-version cdc test, we were seeing failures because CDC was emitting thousands of old events when nodes restarted (#89332). If that were to happen again, I can see the buffer overgrowing this limit even though there's no bug per se?
This change significantly reduce duplicates during node restarts: #102717. I think this is okay. If it flakes, the error should be very obvious and we can fix it.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 130 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
is there a way to assert that no restarts are happening when calling this method (that is: is there a way to drop unsafe from the name and make sure that it is safe)?
Done. I added ExpectNoRestarts
to assert this.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 133 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Some comments on this struct:
- Not sure I understand what the "DBOverride" part of the name is supposed to mean. Is there a DB being overriden?
- Do we need to keep a reference to the fingerprint validator? The
CountValidator
already delegates its validation calls.- Do we need to name things
unsafe*
? It adds a certain unease to interacting with this API that I think is unnecessary. Documenting that the validator should only be used when nodes are not restarting (like you did) is, in my opinion, enough. By construction, this should hold because we're only using the validatorInMixedVersion
andAfterUpgradeFinalized
. I don't think that havingUnsafe
on the name makes misuse less likely, especially because it's unclear what is unsafe about it.
I completely got ride of the unsafe interface. I added the ExpectNoRestarts
method on the Helper
which is better IMO because it is an actual assertion - not enforced by naming/comments. Let me know if you think ExpectNoRestarts
is overly defensive as well. Do you think it is sufficient to just leave a comment on mvt.InMixedVersion
that says these hooks are run only when nodes are up? This makes the API clear.
I kept a reference to the fingerprint validator so we can call DBFunc
on it.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 202 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
See comment about making these implement
userFunc
directly -- why notwaitForResolvedTimestamps(ctx context.Context, l *logger.Logger, r *rand.Rand, h *mixedversion.Helper)
?
Done. I should have done this to begin with :)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 238 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Function name is wrong.
Done.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 245 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This could be simplified by using
h.Exec
.
Done.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 312 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Don't you need to protect this from concurrent writes happening in
runKafkaConsumer
?
Yes we do. I completely missed that. I added a mutex.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 372 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
err != nil
?
Done.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 393 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This is overly defensive, IMO. This function already assumes that many other things are in place (e.g., the cluster is started and ready to receive connections). The workload should also have been initialized by now, by construction.
Sounds good. I removed it.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 395 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
runWorkload()
is already called as a testBackgroundFunc
; no need to create aBackgroundCommand
here.
Done.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 406 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This doesn't sound right.
crdbNodes
(and thereforeRandomNodes
) should already have taken 1-indexing into account.
Yeah you are right. I misread the RandomDB
function.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 222 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
The previous test would connect to random nodes when it needed a database connection. I believe we should continue to do that here -- i.e., call
DBFunc()
on the validator.
Done.
b36ef96
to
5f31ca9
Compare
5f31ca9
to
fc25ef0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, thanks for addressing the comments!
I will, however, make a comment on the new ExpectNoRestarts
API: I don't think we should add it.
Currently, there is a limited number of points during a run when a user function can be executed:
- before any upgrades happen;
- in-mixed version;
- after upgrade has finished;
- in the background throughout the entire test.
Except for 4
, nodes should not restart while the user function is running. This is part of the contract of the framework's API (i.e., OnStartup
, InMixedVersion
, etc.). If restarts were allowed to happen concurrently with user functions, test code would need to be a lot more careful when running SQL statements.
With that said, the question arises: when is one expected to call ExpectNoRestarts
? If they are called during steps 1 through 3, then we are back at "testing the framework". The contract specifies that restarts won't happen while user functions are running (and nodes dying should lead to immediate test failure). By providing an ExpectNoRestarts
function, we're going against that idea, which, to me, would be pretty confusing for people consuming the API.
For possibility number 4 (background functions), the fact that they are running in the background throughout the entire test means, by definition, that they will need to be resilient to node restarts. Unless the framework were to provide a mechanism to stop restarts from happening, calling ExpectNoRestarts
would make no sense, as that's outside of the background function's control. Any attempt to use it in that context would just lead to flakes.
All of this is to repeat a point I had made in my first pass in this PR: I believe the "unsafe-ness" of using the validator is something that is sufficiently handled with documentation; using the validator is no different from calling
db := h.RandomDB()
db.Exec(...)
in a InMixedVersion
function. The db
is "unsafe" because using it if nodes were allowed to restart would lead to errors; and yet, we don't need (or want) to be checking every db
use for "safety".
Sorry for the long message. I'm just attempting to provide a rationale for my view and hopefully explain my thought process. If anything is not clear or you want to discuss this further, I'm happy to chat.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @jayshrivastava, @miretskiy, and @rachitgsrivastava)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 280 at r3 (raw file):
_, resolved, err := cdctest.ParseJSONValueTimestamps(m.Value) if err != nil { return err
Nit: probably a good idea to errors.Wrap()
these return err
in this function with descriptions to make it easier to debug failures in the future.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 405 at r3 (raw file):
Option("tolerate-errors") if err := cmvt.c.RunE(ctx, option.NodeListOption{h.RandomNode(r, cmvt.crdbNodes)}, bankRun.String()); err != nil {
I believe you want to run this on workloadNodes
, not a random node (same for the workload init
part of it).
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 440 at r3 (raw file):
mvt.OnStartup("init workload", tester.initWorkload) _ = mvt.BackgroundFunc("run workload", tester.runWorkload)
Nit: mvt.BackgroundCommand
would simplify this a bit; you'd just need to pass the command itself instead of writing the function yourself.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 445 at r3 (raw file):
// NB: mvt.InMixedVersion will run these hooks multiple times at various points during the rolling upgrade, but // not when any nodes are offline. This is important because the validator relies on a db connection. mvt.InMixedVersion("wait for resolve timestamps", tester.waitForResolvedTimestamps)
Nit: resolved
(same below)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 449 at r3 (raw file):
mvt.AfterUpgradeFinalized("wait for resolve timestamps", tester.waitForResolvedTimestamps) mvt.AfterUpgradeFinalized("validate messages", tester.validate)
This will cause these two functions to run concurrently. Wouldn't it be better if validate
was always called after waitForResolvedTimestamps
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @jayshrivastava, @miretskiy, and @rachitgsrivastava)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 412 at r3 (raw file):
// initWorkload synchronously initializes the workload. func (cmvt *cdcMixedVersionTester) initWorkload(
Optional: on initWorkload
and runWorkload
, you could generate a seed (using the rand.Rand
parameter) and pass that to the workload (--seed
command line argument). This will further help with reproducing failures when you set a fixed seed.
08fbadf
to
80c113f
Compare
80c113f
to
369a231
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I removed the ExpectNoRestarts
API! I agree with your view.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @HonoreDB, @miretskiy, @rachitgsrivastava, and @renatolabs)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 405 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I believe you want to run this on
workloadNodes
, not a random node (same for theworkload init
part of it).
Done.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 412 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Optional: on
initWorkload
andrunWorkload
, you could generate a seed (using therand.Rand
parameter) and pass that to the workload (--seed
command line argument). This will further help with reproducing failures when you set a fixed seed.
Done.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 449 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This will cause these two functions to run concurrently. Wouldn't it be better if
validate
was always called afterwaitForResolvedTimestamps
?
That makes sense. I updated them to be one hook called waitAndValidate
.
369a231
to
6517700
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @HonoreDB, @jayshrivastava, @miretskiy, and @rachitgsrivastava)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 203 at r4 (raw file):
if !cmvt.timestampsResolved.latest.Less(resolvedTs) { return errors.Newf("expected resolved timestamp %s to be greater than previous "+ " resolved timestamp %s", resolvedTs, cmvt.timestampsResolved.latest)
Isn't this covered by any of the validators? Why wouldn't the OrderValidator
perform this check?
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 395 at r4 (raw file):
// Since all rows are placed in a buffer, setting a low rate of 2 operations / sec // helps ensure that we don't exceed the buffer capacity. Flag("max-rate", 2).
This is quite unfortunate; did you see the buffer grow larger than the limit otherwise?
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 397 at r4 (raw file):
Flag("max-rate", 2). Flag("seed", r.Seed). Arg("{pgurl%s}", cmvt.workloadNodes).
I was running this test on my gceworker and noticed: "why am I always seeing 1 resolved timestamps validated
"? It's like we're not seeing updates from the workload. And this is the reason, heh. Workload is trying to connect to itself.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 406 at r4 (raw file):
) error { bankInit := roachtestutil.NewCommand("%s workload init bank", test.DefaultCockroachPath). Flag("seed", r.Seed).
r.Seed
is a function (the command being generated will look something like --seed 0x587c4c0
); the rand.Rand
instance doesn't expose the underlying seed.
My suggestion was more like:
Flag("seed", rng.Int63())
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 407 at r4 (raw file):
bankInit := roachtestutil.NewCommand("%s workload init bank", test.DefaultCockroachPath). Flag("seed", r.Seed). Arg("{pgurl:1}")
Nit: {pgurl%s}, cmvt.crdbNodes
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 431 at r4 (raw file):
mvt.OnStartup("init workload", tester.initWorkload) _ = mvt.BackgroundCommand("run workload", tester.workloadNodes, tester.runWorkloadCmd(mvt.RNG()))
Nit: instead of delaying the generation
This change updates the `cdc/mixed-versions` roachtest to use the mixed version testing framework. This mixed version testing framework is better than the previous framework because it offers testing for multiple upgrades. It will also make it easier to maintain and expand this cdc-specific test. Informs: cockroachdb#107451 Epic: None Release note: None
6517700
to
d944ff8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @HonoreDB, @miretskiy, @rachitgsrivastava, and @renatolabs)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 203 at r4 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Isn't this covered by any of the validators? Why wouldn't the
OrderValidator
perform this check?
Surprisingly, it's not. It should be though. Tracked by #108872
The order validator is also used for cluster-cluster streaming and that allows regressions in the resolved timestamp. I'm not sure why.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 395 at r4 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This is quite unfortunate; did you see the buffer grow larger than the limit otherwise?
Yes. Actually I had this set to 5 and things were okay, but I think playing it safe with 2 ops / sec doesn't hurt.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 397 at r4 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I was running this test on my gceworker and noticed: "why am I always seeing
1 resolved timestamps validated
"? It's like we're not seeing updates from the workload. And this is the reason, heh. Workload is trying to connect to itself.
Ah thanks for catching it. Looks like tolerate-errors
was tolerating that error. I tried it locally and it looks okay now.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 406 at r4 (raw file):
Previously, renatolabs (Renato Costa) wrote…
r.Seed
is a function (the command being generated will look something like--seed 0x587c4c0
); therand.Rand
instance doesn't expose the underlying seed.My suggestion was more like:
Flag("seed", rng.Int63())
Done.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 431 at r4 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: instead of delaying the generation
Looks like your comment got cut off, but I assume you mean
runWorkloadCmd := tester.runWorkloadCmd(mvt.RNG())
_ = mvt.BackgroundCommand("run workload", tester.workloadNodes, runWorkloadCmd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 👏 Thanks for working on this!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @HonoreDB, @jayshrivastava, @miretskiy, and @rachitgsrivastava)
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 395 at r4 (raw file):
Previously, jayshrivastava (Jayant) wrote…
Yes. Actually I had this set to 5 and things were okay, but I think playing it safe with 2 ops / sec doesn't hurt.
Got it. I think this is enough to warrant a mechanism in the mixedversion framework to allow background functions to stop restarts while some functions is running. I'll make a TODO for myself here.
pkg/cmd/roachtest/tests/mixed_version_cdc.go
line 431 at r4 (raw file):
Previously, jayshrivastava (Jayant) wrote…
Looks like your comment got cut off, but I assume you mean
runWorkloadCmd := tester.runWorkloadCmd(mvt.RNG()) _ = mvt.BackgroundCommand("run workload", tester.workloadNodes, runWorkloadCmd)
Heh yeah, I forgot to come back to this, but that's indeed where I was going.
Added backport label as it would be nice to have this on release-23.1 as well. |
TYFR! |
bors r=renatolabs |
Build succeeded: |
This change updates the
cdc/mixed-versions
roachtest to use the mixed version testing framework. This mixed version testing framework is better than the previous framework because it offers testing for multiple upgrades. It will also make it easier to maintain and expand this cdc-specific test.Informs: #107451
Epic: None
Release note: None