Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest/cdc/mixed-versions: use mixed version framework #108802

Merged
merged 1 commit into from
Aug 22, 2023

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Aug 15, 2023

This change updates the cdc/mixed-versions roachtest to use the mixed version testing framework. This mixed version testing framework is better than the previous framework because it offers testing for multiple upgrades. It will also make it easier to maintain and expand this cdc-specific test.

Informs: #107451
Epic: None
Release note: None

@jayshrivastava jayshrivastava force-pushed the roachtest-refactor-cdc-mv branch from a5c752a to 3b0a059 Compare August 15, 2023 19:53
@jayshrivastava
Copy link
Contributor Author

I manually ran the test and verified it all works.

@jayshrivastava jayshrivastava marked this pull request as ready for review August 15, 2023 19:54
@jayshrivastava jayshrivastava requested a review from a team as a code owner August 15, 2023 19:54
@jayshrivastava jayshrivastava requested review from rachitgsrivastava and DarrylWong and removed request for a team August 15, 2023 19:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava force-pushed the roachtest-refactor-cdc-mv branch from 3b0a059 to b583882 Compare August 16, 2023 20:40
pkg/cmd/roachtest/roachtestutil/mixedversion/helper.go Outdated Show resolved Hide resolved
// when using the validator (hence the "Unsafe" API). A mixed version state
// during validation is okay, but connections should not be able to be torn
// down while performing validations.
type validatorWithDBOverride struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on this struct:

  • Not sure I understand what the "DBOverride" part of the name is supposed to mean. Is there a DB being overriden?
  • Do we need to keep a reference to the fingerprint validator? The CountValidator already delegates its validation calls.
  • Do we need to name things unsafe*? It adds a certain unease to interacting with this API that I think is unnecessary. Documenting that the validator should only be used when nodes are not restarting (like you did) is, in my opinion, enough. By construction, this should hold because we're only using the validator InMixedVersion and AfterUpgradeFinalized. I don't think that having Unsafe on the name makes misuse less likely, especially because it's unclear what is unsafe about it.

pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved

// kafkaBufferMessageSize is the number of messages from kafka
// we allow to be buffered in memory before validating them.
kafkaBufferMessageSize = 8192
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this will lead to flakes. I remember that when I wrote the initial version of this mixed-version cdc test, we were seeing failures because CDC was emitting thousands of old events when nodes restarted (#89332). If that were to happen again, I can see the buffer overgrowing this limit even though there's no bug per se?

pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved
@renatolabs
Copy link
Contributor

Thanks for working on this! I think coverage of mixed-version for CDC will improve as we add more mixed-version scenarios to the framework (multiple upgrades are coming soon -- #108773) without having to change the test (too much).

Would be nice to add some randomness to many aspects of this test. Some ideas:

  • picking the resolved duration randomly across a set of options;
  • setting different changefeed options when creating it;
  • different parameters to the bank workload (number of rows, payload size, etc)
  • different workloads

Thoughts? Not for this PR, of course, just thought it could be a good time to make these changes.

@jayshrivastava jayshrivastava force-pushed the roachtest-refactor-cdc-mv branch from b583882 to 27f2475 Compare August 16, 2023 20:48
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Your PR is great. I'm glad that we will be able to provide a min version to start from.

I can write up a follow up issue to add these uses of rng. They look good!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, @rachitgsrivastava, and @renatolabs)


pkg/cmd/roachtest/roachtestutil/mixedversion/helper.go line 111 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I believe this implementation is based on the existing BackgroundCommand helper. I kind of regret having that helper and planned to remove it (there are currently no callers; it served a purpose at some point, but no longer).

It's nice that it logs the command before running it, but roachtest itself will start doing it soon [1] so, if anything, this will log commands twice which could be confusing.

Maybe a helper is not needed -- the cdcMixedVersionTester can easily keep a reference to the cluster and use that when running commands. What do you think?

[1] https://github.com/cockroachdb/cockroach/pull/104915/files#diff-d977965ecfd7f29eadbd742185581ca38940f7164721cd2dfdcc17602ea78f7cR2145

I removed it and added a reference to the cluster in cdcMixedVersionTester. 👍


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 170 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Do we really need to make this public in this PR? All of the current references could be removed if the methods on cdcMixedVersionTester implemented userFunc directly. That would even make them more readable, IMO.

Done.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 49 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I wonder if this will lead to flakes. I remember that when I wrote the initial version of this mixed-version cdc test, we were seeing failures because CDC was emitting thousands of old events when nodes restarted (#89332). If that were to happen again, I can see the buffer overgrowing this limit even though there's no bug per se?

This change significantly reduce duplicates during node restarts: #102717. I think this is okay. If it flakes, the error should be very obvious and we can fix it.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 130 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

is there a way to assert that no restarts are happening when calling this method (that is: is there a way to drop unsafe from the name and make sure that it is safe)?

Done. I added ExpectNoRestarts to assert this.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 133 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Some comments on this struct:

  • Not sure I understand what the "DBOverride" part of the name is supposed to mean. Is there a DB being overriden?
  • Do we need to keep a reference to the fingerprint validator? The CountValidator already delegates its validation calls.
  • Do we need to name things unsafe*? It adds a certain unease to interacting with this API that I think is unnecessary. Documenting that the validator should only be used when nodes are not restarting (like you did) is, in my opinion, enough. By construction, this should hold because we're only using the validator InMixedVersion and AfterUpgradeFinalized. I don't think that having Unsafe on the name makes misuse less likely, especially because it's unclear what is unsafe about it.

I completely got ride of the unsafe interface. I added the ExpectNoRestarts method on the Helper which is better IMO because it is an actual assertion - not enforced by naming/comments. Let me know if you think ExpectNoRestarts is overly defensive as well. Do you think it is sufficient to just leave a comment on mvt.InMixedVersion that says these hooks are run only when nodes are up? This makes the API clear.

I kept a reference to the fingerprint validator so we can call DBFunc on it.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 202 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

See comment about making these implement userFunc directly -- why not waitForResolvedTimestamps(ctx context.Context, l *logger.Logger, r *rand.Rand, h *mixedversion.Helper)?

Done. I should have done this to begin with :)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 238 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Function name is wrong.

Done.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 245 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This could be simplified by using h.Exec.

Done.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 312 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Don't you need to protect this from concurrent writes happening in runKafkaConsumer?

Yes we do. I completely missed that. I added a mutex.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 372 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

err != nil?

Done.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 393 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This is overly defensive, IMO. This function already assumes that many other things are in place (e.g., the cluster is started and ready to receive connections). The workload should also have been initialized by now, by construction.

Sounds good. I removed it.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 395 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

runWorkload() is already called as a test BackgroundFunc; no need to create a BackgroundCommand here.

Done.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 406 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This doesn't sound right. crdbNodes (and therefore RandomNodes) should already have taken 1-indexing into account.

Yeah you are right. I misread the RandomDB function.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 222 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

The previous test would connect to random nodes when it needed a database connection. I believe we should continue to do that here -- i.e., call DBFunc() on the validator.

Done.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks for addressing the comments!


I will, however, make a comment on the new ExpectNoRestarts API: I don't think we should add it.

Currently, there is a limited number of points during a run when a user function can be executed:

  1. before any upgrades happen;
  2. in-mixed version;
  3. after upgrade has finished;
  4. in the background throughout the entire test.

Except for 4, nodes should not restart while the user function is running. This is part of the contract of the framework's API (i.e., OnStartup, InMixedVersion, etc.). If restarts were allowed to happen concurrently with user functions, test code would need to be a lot more careful when running SQL statements.

With that said, the question arises: when is one expected to call ExpectNoRestarts? If they are called during steps 1 through 3, then we are back at "testing the framework". The contract specifies that restarts won't happen while user functions are running (and nodes dying should lead to immediate test failure). By providing an ExpectNoRestarts function, we're going against that idea, which, to me, would be pretty confusing for people consuming the API.

For possibility number 4 (background functions), the fact that they are running in the background throughout the entire test means, by definition, that they will need to be resilient to node restarts. Unless the framework were to provide a mechanism to stop restarts from happening, calling ExpectNoRestarts would make no sense, as that's outside of the background function's control. Any attempt to use it in that context would just lead to flakes.

All of this is to repeat a point I had made in my first pass in this PR: I believe the "unsafe-ness" of using the validator is something that is sufficiently handled with documentation; using the validator is no different from calling

db := h.RandomDB()
db.Exec(...)

in a InMixedVersion function. The db is "unsafe" because using it if nodes were allowed to restart would lead to errors; and yet, we don't need (or want) to be checking every db use for "safety".


Sorry for the long message. I'm just attempting to provide a rationale for my view and hopefully explain my thought process. If anything is not clear or you want to discuss this further, I'm happy to chat.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @jayshrivastava, @miretskiy, and @rachitgsrivastava)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 280 at r3 (raw file):

			_, resolved, err := cdctest.ParseJSONValueTimestamps(m.Value)
			if err != nil {
				return err

Nit: probably a good idea to errors.Wrap() these return err in this function with descriptions to make it easier to debug failures in the future.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 405 at r3 (raw file):

		Option("tolerate-errors")

	if err := cmvt.c.RunE(ctx, option.NodeListOption{h.RandomNode(r, cmvt.crdbNodes)}, bankRun.String()); err != nil {

I believe you want to run this on workloadNodes, not a random node (same for the workload init part of it).


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 440 at r3 (raw file):

	mvt.OnStartup("init workload", tester.initWorkload)

	_ = mvt.BackgroundFunc("run workload", tester.runWorkload)

Nit: mvt.BackgroundCommand would simplify this a bit; you'd just need to pass the command itself instead of writing the function yourself.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 445 at r3 (raw file):

	// NB: mvt.InMixedVersion will run these hooks multiple times at various points during the rolling upgrade, but
	// not when any nodes are offline. This is important because the validator relies on a db connection.
	mvt.InMixedVersion("wait for resolve timestamps", tester.waitForResolvedTimestamps)

Nit: resolved (same below)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 449 at r3 (raw file):

	mvt.AfterUpgradeFinalized("wait for resolve timestamps", tester.waitForResolvedTimestamps)
	mvt.AfterUpgradeFinalized("validate messages", tester.validate)

This will cause these two functions to run concurrently. Wouldn't it be better if validate was always called after waitForResolvedTimestamps?

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @jayshrivastava, @miretskiy, and @rachitgsrivastava)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 412 at r3 (raw file):

// initWorkload synchronously initializes the workload.
func (cmvt *cdcMixedVersionTester) initWorkload(

Optional: on initWorkload and runWorkload, you could generate a seed (using the rand.Rand parameter) and pass that to the workload (--seed command line argument). This will further help with reproducing failures when you set a fixed seed.

@jayshrivastava jayshrivastava force-pushed the roachtest-refactor-cdc-mv branch 3 times, most recently from 08fbadf to 80c113f Compare August 21, 2023 18:06
@jayshrivastava jayshrivastava requested a review from a team as a code owner August 21, 2023 18:06
@jayshrivastava jayshrivastava requested review from HonoreDB and removed request for a team August 21, 2023 18:06
@jayshrivastava jayshrivastava force-pushed the roachtest-refactor-cdc-mv branch from 80c113f to 369a231 Compare August 21, 2023 18:06
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I removed the ExpectNoRestarts API! I agree with your view.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @HonoreDB, @miretskiy, @rachitgsrivastava, and @renatolabs)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 405 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I believe you want to run this on workloadNodes, not a random node (same for the workload init part of it).

Done.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 412 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Optional: on initWorkload and runWorkload, you could generate a seed (using the rand.Rand parameter) and pass that to the workload (--seed command line argument). This will further help with reproducing failures when you set a fixed seed.

Done.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 449 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This will cause these two functions to run concurrently. Wouldn't it be better if validate was always called after waitForResolvedTimestamps?

That makes sense. I updated them to be one hook called waitAndValidate.

@jayshrivastava jayshrivastava force-pushed the roachtest-refactor-cdc-mv branch from 369a231 to 6517700 Compare August 21, 2023 18:38
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @HonoreDB, @jayshrivastava, @miretskiy, and @rachitgsrivastava)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 203 at r4 (raw file):

			if !cmvt.timestampsResolved.latest.Less(resolvedTs) {
				return errors.Newf("expected resolved timestamp %s to be greater than previous "+
					" resolved timestamp %s", resolvedTs, cmvt.timestampsResolved.latest)

Isn't this covered by any of the validators? Why wouldn't the OrderValidator perform this check?


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 395 at r4 (raw file):

		// Since all rows are placed in a buffer, setting a low rate of 2 operations / sec
		// helps ensure that we don't exceed the buffer capacity.
		Flag("max-rate", 2).

This is quite unfortunate; did you see the buffer grow larger than the limit otherwise?


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 397 at r4 (raw file):

		Flag("max-rate", 2).
		Flag("seed", r.Seed).
		Arg("{pgurl%s}", cmvt.workloadNodes).

I was running this test on my gceworker and noticed: "why am I always seeing 1 resolved timestamps validated"? It's like we're not seeing updates from the workload. And this is the reason, heh. Workload is trying to connect to itself.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 406 at r4 (raw file):

) error {
	bankInit := roachtestutil.NewCommand("%s workload init bank", test.DefaultCockroachPath).
		Flag("seed", r.Seed).

r.Seed is a function (the command being generated will look something like --seed 0x587c4c0); the rand.Rand instance doesn't expose the underlying seed.

My suggestion was more like:

Flag("seed", rng.Int63())

pkg/cmd/roachtest/tests/mixed_version_cdc.go line 407 at r4 (raw file):

	bankInit := roachtestutil.NewCommand("%s workload init bank", test.DefaultCockroachPath).
		Flag("seed", r.Seed).
		Arg("{pgurl:1}")

Nit: {pgurl%s}, cmvt.crdbNodes


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 431 at r4 (raw file):

	mvt.OnStartup("init workload", tester.initWorkload)

	_ = mvt.BackgroundCommand("run workload", tester.workloadNodes, tester.runWorkloadCmd(mvt.RNG()))

Nit: instead of delaying the generation

This change updates the `cdc/mixed-versions` roachtest to use
the mixed version testing framework. This mixed version testing
framework is better than the previous framework because it offers
testing for multiple upgrades. It will also make it easier to
maintain and expand this cdc-specific test.

Informs: cockroachdb#107451
Epic: None
Release note: None
@jayshrivastava jayshrivastava force-pushed the roachtest-refactor-cdc-mv branch from 6517700 to d944ff8 Compare August 21, 2023 21:36
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @HonoreDB, @miretskiy, @rachitgsrivastava, and @renatolabs)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 203 at r4 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Isn't this covered by any of the validators? Why wouldn't the OrderValidator perform this check?

Surprisingly, it's not. It should be though. Tracked by #108872

The order validator is also used for cluster-cluster streaming and that allows regressions in the resolved timestamp. I'm not sure why.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 395 at r4 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This is quite unfortunate; did you see the buffer grow larger than the limit otherwise?

Yes. Actually I had this set to 5 and things were okay, but I think playing it safe with 2 ops / sec doesn't hurt.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 397 at r4 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I was running this test on my gceworker and noticed: "why am I always seeing 1 resolved timestamps validated"? It's like we're not seeing updates from the workload. And this is the reason, heh. Workload is trying to connect to itself.

Ah thanks for catching it. Looks like tolerate-errors was tolerating that error. I tried it locally and it looks okay now.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 406 at r4 (raw file):

Previously, renatolabs (Renato Costa) wrote…

r.Seed is a function (the command being generated will look something like --seed 0x587c4c0); the rand.Rand instance doesn't expose the underlying seed.

My suggestion was more like:

Flag("seed", rng.Int63())

Done.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 431 at r4 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: instead of delaying the generation

Looks like your comment got cut off, but I assume you mean

runWorkloadCmd := tester.runWorkloadCmd(mvt.RNG())  
_ = mvt.BackgroundCommand("run workload", tester.workloadNodes, runWorkloadCmd)

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

👏 👏 Thanks for working on this!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @HonoreDB, @jayshrivastava, @miretskiy, and @rachitgsrivastava)


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 395 at r4 (raw file):

Previously, jayshrivastava (Jayant) wrote…

Yes. Actually I had this set to 5 and things were okay, but I think playing it safe with 2 ops / sec doesn't hurt.

Got it. I think this is enough to warrant a mechanism in the mixedversion framework to allow background functions to stop restarts while some functions is running. I'll make a TODO for myself here.


pkg/cmd/roachtest/tests/mixed_version_cdc.go line 431 at r4 (raw file):

Previously, jayshrivastava (Jayant) wrote…

Looks like your comment got cut off, but I assume you mean

runWorkloadCmd := tester.runWorkloadCmd(mvt.RNG())  
_ = mvt.BackgroundCommand("run workload", tester.workloadNodes, runWorkloadCmd)

Heh yeah, I forgot to come back to this, but that's indeed where I was going.

@renatolabs renatolabs added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 22, 2023
@renatolabs
Copy link
Contributor

renatolabs commented Aug 22, 2023

Added backport label as it would be nice to have this on release-23.1 as well.

@jayshrivastava
Copy link
Contributor Author

TYFR!

@jayshrivastava
Copy link
Contributor Author

bors r=renatolabs

@craig
Copy link
Contributor

craig bot commented Aug 22, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants