Skip to content

Fixing backup tests flakiness#12655

Merged
deepthi merged 3 commits intovitessio:mainfrom
planetscale:VTorcRemoveFromBackup_2
Mar 23, 2023
Merged

Fixing backup tests flakiness#12655
deepthi merged 3 commits intovitessio:mainfrom
planetscale:VTorcRemoveFromBackup_2

Conversation

@rsajwani
Copy link
Copy Markdown
Contributor

@rsajwani rsajwani commented Mar 18, 2023

Description

Backup Tests TestBackupMysqlctld/TestPrimaryBackup is flaky. Here is one such instance of failure

https://github.com/vitessio/vitess/actions/runs/4477008834/jobs/7868010386

The failure is coming from this piece of test code

// Perform PRS to demote the primary tablet (primary) so that we can do a restore there and verify we don't have the
	// data from after the older/first backup
	err = localCluster.VtctlclientProcess.ExecuteCommand("PlannedReparentShard", "--",
		"--keyspace_shard", shardKsName,
		"--new_primary", replica2.Alias)
	require.Nil(t, err)

	// Delete the current primary tablet (replica2) so that the original primary tablet (primary) can be restored from the
	// older/first backup w/o it replicating the subsequent insert done after the first backup was taken
	err = localCluster.VtctlclientProcess.ExecuteCommand("DeleteTablet", "--", "--allow_primary=true", replica2.Alias)
	require.Nil(t, err)
	err = replica2.VttabletProcess.TearDown()
	require.Nil(t, err)

	// Restore the older/first backup -- using the timestamp we saved -- on the original primary tablet (primary)
	err = localCluster.VtctlclientProcess.ExecuteCommand("RestoreFromBackup", "--", "--backup_timestamp", firstBackupTimestamp, primary.Alias)
	require.Nil(t, err)

The reason is we have introduced VTOrc in the test which tries to fix Primary when it is down. When we delete the tablet, VTorc tries to elect a new primary. If it elect replica1 as new primary then we are good but if it elects primary as new one then RestoreFromBackup fails.

Related Issue(s)

closes #12687

Solution

There were few ways to resolve it.

  1. We could have remove VTorc entirely from backup tests
  2. We disable VTorc in that specific test which is failing because of VTOrc intervention.
  3. We can choose the right replica at the point and use that for restore API.

I went with option#2, as it looked neat and doesn't result in a lot of changes in code.

Testing

Ran test few times locally and then on github action. It looked much stable now.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@rsajwani rsajwani changed the title remove unnecessary vtorc process from backup test [Do not remove]remove unnecessary vtorc process from backup test Mar 18, 2023
@rsajwani rsajwani changed the title [Do not remove]remove unnecessary vtorc process from backup test [Do not review]remove unnecessary vtorc process from backup test Mar 18, 2023
@rsajwani rsajwani self-assigned this Mar 20, 2023
@rsajwani rsajwani added Component: General Changes throughout the code base Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Bug and removed Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Mar 20, 2023
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
// Args:
// tablet_type: 'replica' or 'rdonly'.
func vtctlBackup(t *testing.T, tabletType string) {
// Start vtorc before running backups
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was creating multiple instances of VTorc during testing. I don't think this was intentional. If we want to test multi instance VTorc then we should write a separate test for it. This is backup test and should try to test backup functionality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this wasn't intentional.

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@rsajwani rsajwani marked this pull request as ready for review March 22, 2023 01:25
@rsajwani rsajwani requested a review from deepthi as a code owner March 22, 2023 01:25
@rsajwani rsajwani changed the title [Do not review]remove unnecessary vtorc process from backup test Fixing Backup Tests Flakiness Mar 22, 2023
@rsajwani rsajwani changed the title Fixing Backup Tests Flakiness Fixing backup tests flakiness Mar 22, 2023
@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented Mar 22, 2023

I don't understand what the test was trying to do with deleting replica2 right after making it the primary. Ideally we rewrite the test to make sense. However, it is valid to say that we don't want to run vtorc recoveries in backup/restore tests, so this PR is still good.

Copy link
Copy Markdown
Contributor

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@deepthi deepthi merged commit 3e9911b into vitessio:main Mar 23, 2023
@deepthi deepthi deleted the VTorcRemoveFromBackup_2 branch March 23, 2023 22:35
rsajwani added a commit to planetscale/vitess that referenced this pull request Mar 27, 2023
* remove unnecessary vtorc process

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>

* Adding vtorc disabling at right place

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>

* Adding comments

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>

---------

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
GuptaManan100 pushed a commit that referenced this pull request Mar 28, 2023
* remove unnecessary vtorc process



* Adding vtorc disabling at right place



* Adding comments



---------

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@hmaurer hmaurer mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: General Changes throughout the code base Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Test "TestBackupMysqlctld/TestPrimaryBackup"

3 participants