Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
0aaf7d8
`vtorc`: add support for dynamic enable/disable of ERS by keyspace/shard
timvaillancourt Apr 8, 2025
a83ed22
Merge remote-tracking branch 'origin/main' into vtorc-ks-topo-config
timvaillancourt Apr 8, 2025
0ffdf40
move new structs to new proto package
timvaillancourt Jun 17, 2025
16c91e6
Merge remote-tracking branch 'origin/main' into vtorc-ks-topo-config
timvaillancourt Jun 17, 2025
568ccc0
simplify changelog
timvaillancourt Jun 17, 2025
8bef5c7
simplify changelog again
timvaillancourt Jun 17, 2025
d94bb21
update shard conditional
timvaillancourt Jun 17, 2025
52a2245
update shard conditional, again
timvaillancourt Jun 17, 2025
ffbad5e
one more field rename
timvaillancourt Jun 17, 2025
2cdba44
update proto comment
timvaillancourt Jun 17, 2025
0d55c73
missing rename
timvaillancourt Jun 17, 2025
96b0d3e
update `TestAPI`
timvaillancourt Jun 17, 2025
601ab1f
Merge branch 'main' into vtorc-ks-topo-config
timvaillancourt Jun 30, 2025
3841960
gofmt
timvaillancourt Jun 30, 2025
890aeac
Merge remote-tracking branch 'origin/main' into vtorc-ks-topo-config
timvaillancourt Aug 29, 2025
cbbbd3b
revert .CreateShard signature change
timvaillancourt Aug 29, 2025
50d8fe4
fix test failures
timvaillancourt Aug 29, 2025
f4612f6
fix test
timvaillancourt Aug 29, 2025
b75b70b
update protos
timvaillancourt Aug 29, 2025
58e7df3
PR suggestion: rename field
timvaillancourt Aug 29, 2025
5a98ae0
lint
timvaillancourt Aug 29, 2025
5134083
test/changelog fix
timvaillancourt Aug 29, 2025
7c6fc22
add e2e test
timvaillancourt Aug 29, 2025
efc0bab
rm metric check
timvaillancourt Aug 29, 2025
9ec8abf
cleanup
timvaillancourt Aug 29, 2025
624bb80
e2e fix
timvaillancourt Aug 29, 2025
06dbe21
improve test
timvaillancourt Aug 29, 2025
6591a49
use assert
timvaillancourt Aug 29, 2025
c9e639b
cleanup test
timvaillancourt Aug 29, 2025
0e67548
update docs
timvaillancourt Aug 29, 2025
7f6cd7c
major -> minor feature
timvaillancourt Aug 29, 2025
66b12c2
rename test helper
timvaillancourt Aug 30, 2025
900f4ac
rename test helper again
timvaillancourt Aug 30, 2025
f3b3718
PR suggestions, wait for skipped recovery
timvaillancourt Sep 2, 2025
ed0a892
add changelog for new metric
timvaillancourt Sep 2, 2025
f89a760
typo fix
timvaillancourt Sep 2, 2025
32932f9
check primary after skipped recovery wait
timvaillancourt Sep 2, 2025
89c7c47
missing rename
timvaillancourt Sep 2, 2025
65cce15
rm dupe func
timvaillancourt Sep 2, 2025
bf5c551
move `SkippedRecoveries` to better place
timvaillancourt Sep 3, 2025
e824bd0
use `assert.EventuallyWithT(...)` for test helpers
timvaillancourt Sep 3, 2025
4ea4fc4
fix typo
timvaillancourt Sep 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions changelog/23.0/23.0.0/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,11 @@
- [Metrics](#deleted-metrics)
- **[New Metrics](#new-metrics)**
- [VTGate](#new-vtgate-metrics)
- [VTOrc](#new-vtorc-metrics)
- **[Topology](#minor-changes-topo)**
- [`--consul_auth_static_file` requires 1 or more credentials](#consul_auth_static_file-check-creds)
- **[VTOrc](#minor-changes-vtorc)**
- [Dynamic control of `EmergencyReparentShard`-based recoveries](#vtorc-dynamic-ers-disabled)
- [Recovery stats to include keyspace/shard](#recoveries-stats-keyspace-shard)
- **[VTTablet](#minor-changes-vttablet)**
- [CLI Flags](#flags-vttablet)
Expand Down Expand Up @@ -64,6 +66,12 @@ VTGate also advertises MySQL version `8.4.6` by default instead of `8.0.40`. If
|:-----------------------:|:---------------:|:-----------------------------------------------------------------------------------:|:-------------------------------------------------------:|
| `TransactionsProcessed` | `Shard`, `Type` | Counts transactions processed at VTGate by shard distribution and transaction type. | [#18171](https://github.com/vitessio/vitess/pull/18171) |

#### <a id="new-vtorc-metrics"/>VTOrc

| Name | Dimensions | Description | PR |
|:-------------------:|:-----------------------------------:|:----------------------------------------------------:|:-------------------------------------------------------:|
| `SkippedRecoveries` | `RecoveryName`, `Keyspace`, `Shard` | Count of the different skipped recoveries performed. | [#17985](https://github.com/vitessio/vitess/pull/17985) |

### <a id="minor-changes-topo"/>Topology</a>

#### <a id="consul_auth_static_file-check-creds"/>`--consul_auth_static_file` requires 1 or more credentials</a>
Expand All @@ -72,6 +80,14 @@ The `--consul_auth_static_file` flag used in several components now requires tha

### <a id="minor-changes-vtorc"/>VTOrc</a>

#### <a id="vtorc-dynamic-ers-disabled"/>Dynamic control of `EmergencyReparentShard`-based recoveries</a>

**Note: disabling `EmergencyReparentShard`-based recoveries introduces availability risks; please use with extreme caution! If you rely on this functionality often, for example in automation, this may be signs of an anti-pattern. If so, please open an issue to discuss supporting your use case natively in VTOrc.**

The new `vtctldclient` RPC `SetVtorcEmergencyReparent` was introduced to allow VTOrc recoveries involving `EmergencyReparentShard` actions to be disabled on a per-keyspace and/or per-shard basis. Previous to this version, disabling EmergencyReparentShard-based recoveries was only possible globally/per-VTOrc-instance. VTOrc will now consider this keyspace/shard-level setting that is refreshed from the topo on each recovery. The disabled state is determined by first checking if the keyspace, and then the shard state. Removing a keyspace-level override does not remove per-shard overrides.

To provide observability of keyspace/shards with EmergencyReparentShard-based VTOrc recoveries disabled, the `EmergencyReparentShardDisabled` metric was added. This metric label can be used to create alerting to ensure EmergencyReparentShard-based recoveries are not disabled for an undesired period of time.

#### <a id="recoveries-stats-keyspace-shard">Recovery stats to include keyspace/shard</a>

The following recovery-related stats now include labels for keyspaces and shards:
Expand Down
7 changes: 3 additions & 4 deletions go/cmd/vtctldclient/command/keyspaces.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,15 @@ import (

"github.com/spf13/cobra"

"vitess.io/vitess/go/mysql/sqlerror"
"vitess.io/vitess/go/protoutil"
"vitess.io/vitess/go/vt/vtctl/reparentutil/policy"

"vitess.io/vitess/go/cmd/vtctldclient/cli"
"vitess.io/vitess/go/constants/sidecar"
"vitess.io/vitess/go/mysql"
"vitess.io/vitess/go/mysql/sqlerror"
"vitess.io/vitess/go/protoutil"
topodatapb "vitess.io/vitess/go/vt/proto/topodata"
vtctldatapb "vitess.io/vitess/go/vt/proto/vtctldata"
"vitess.io/vitess/go/vt/proto/vttime"
"vitess.io/vitess/go/vt/vtctl/reparentutil/policy"
)

var (
Expand Down
51 changes: 49 additions & 2 deletions go/cmd/vtctldclient/command/topology.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ import (
"github.com/spf13/cobra"

"vitess.io/vitess/go/cmd/vtctldclient/cli"
"vitess.io/vitess/go/vt/topo"

vtctldatapb "vitess.io/vitess/go/vt/proto/vtctldata"
"vitess.io/vitess/go/vt/topo"
"vitess.io/vitess/go/vt/topo/topoproto"
)

var (
Expand All @@ -39,6 +39,16 @@ var (
RunE: commandGetTopologyPath,
}

// SetVtorcEmergencyReparent enables/disables the use of EmergencyReparentShard in VTOrc recoveries for a given keyspace or keyspace/shard.
SetVtorcEmergencyReparent = &cobra.Command{
Use: "SetVtorcEmergencyReparent [--enable|-e] [--disable|-d] <keyspace> <shard>",
Short: "Enable/disables the use of EmergencyReparentShard in VTOrc recoveries for a given keyspace or keyspace/shard.",
DisableFlagsInUseLine: true,
Aliases: []string{"setvtorcemergencyreparent"},
Args: cobra.RangeArgs(1, 2),
RunE: commandSetVtorcEmergencyReparent,
}

// WriteTopologyPath writes the contents of a local file to a path
// in the topology server.
WriteTopologyPath = &cobra.Command{
Expand Down Expand Up @@ -96,6 +106,39 @@ func commandGetTopologyPath(cmd *cobra.Command, args []string) error {
return nil
}

var setVtorcEmergencyReparentOptions = struct {
Disable bool
Enable bool
}{}

func commandSetVtorcEmergencyReparent(cmd *cobra.Command, args []string) error {
cli.FinishedParsing(cmd)

ks := cmd.Flags().Arg(0)
shard := cmd.Flags().Arg(1)
keyspaceShard := topoproto.KeyspaceShardString(ks, shard)
if !setVtorcEmergencyReparentOptions.Disable && !setVtorcEmergencyReparentOptions.Enable {
return fmt.Errorf("SetVtorcEmergencyReparent(%v) error: must set --enable or --disable flag", keyspaceShard)
}
if setVtorcEmergencyReparentOptions.Disable && setVtorcEmergencyReparentOptions.Enable {
return fmt.Errorf("SetVtorcEmergencyReparent(%v) error: --enable and --disable flags are mutually exclusive", keyspaceShard)
}

_, err := client.SetVtorcEmergencyReparent(commandCtx, &vtctldatapb.SetVtorcEmergencyReparentRequest{
Keyspace: ks,
Shard: shard,
Disable: setVtorcEmergencyReparentOptions.Disable,
})

if err != nil {
return fmt.Errorf("SetVtorcEmergencyReparent(%v) error: %w; please check the topo", keyspaceShard, err)
}

fmt.Printf("Successfully updated keyspace/shard %v.\n", keyspaceShard)

return nil
}

var writeTopologyPathOptions = struct {
// The cell to use for the copy. Defaults to the global cell.
cell string
Expand Down Expand Up @@ -131,6 +174,10 @@ func init() {
GetTopologyPath.Flags().BoolVar(&getTopologyPathOptions.dataAsJSON, "data-as-json", getTopologyPathOptions.dataAsJSON, "If true, only the data is output and it is in JSON format rather than prototext.")
Root.AddCommand(GetTopologyPath)

Root.AddCommand(SetVtorcEmergencyReparent)
SetVtorcEmergencyReparent.Flags().BoolVarP(&setVtorcEmergencyReparentOptions.Disable, "disable", "d", false, "Disable the use of EmergencyReparentShard in recoveries.")
SetVtorcEmergencyReparent.Flags().BoolVarP(&setVtorcEmergencyReparentOptions.Enable, "enable", "e", false, "Enable the use of EmergencyReparentShard in recoveries.")

WriteTopologyPath.Flags().StringVar(&writeTopologyPathOptions.cell, "cell", topo.GlobalCell, "Topology server cell to copy the file to.")
Root.AddCommand(WriteTopologyPath)
}
1 change: 1 addition & 0 deletions go/flags/endtoend/vtctldclient.txt
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ Available Commands:
SetKeyspaceDurabilityPolicy Sets the durability-policy used by the specified keyspace.
SetShardIsPrimaryServing Add or remove a shard from serving. This is meant as an emergency function. It does not rebuild any serving graphs; i.e. it does not run `RebuildKeyspaceGraph`.
SetShardTabletControl Sets the TabletControl record for a shard and tablet type. Only use this for an emergency fix or after a finished MoveTables.
SetVtorcEmergencyReparent Enable/disables the use of EmergencyReparentShard in VTOrc recoveries for a given keyspace or keyspace/shard.
SetWritable Sets the specified tablet as writable or read-only.
ShardReplicationFix Walks through a ShardReplication object and fixes the first error encountered.
ShardReplicationPositions
Expand Down
1 change: 1 addition & 0 deletions go/test/endtoend/vtorc/api/api_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ func TestAPIEndpoints(t *testing.T) {
"TableName": "vitess_keyspace",
"Rows": [
{
"disable_emergency_reparent": "0",
"durability_policy": "none",
"keyspace": "ks",
"keyspace_type": "0"
Expand Down
86 changes: 86 additions & 0 deletions go/test/endtoend/vtorc/primaryfailure/primary_failure_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,92 @@ func TestDownPrimary(t *testing.T) {
})
}

// bring down primary, with keyspace-level ERS disabled via SetVtorcEmergencyReparent --disable.
// confirm no ERS occurs.
func TestDownPrimary_KeyspaceEmergencyReparentDisabled(t *testing.T) {
defer utils.PrintVTOrcLogsOnFailure(t, clusterInfo.ClusterInstance)
utils.SetupVttabletsAndVTOrcs(t, clusterInfo, 2, 1, []string{fmt.Sprintf("%s=10s", vtutils.GetFlagVariantForTests("--remote-operation-timeout")), "--wait-replicas-timeout=5s"}, cluster.VTOrcConfiguration{}, 1, policy.DurabilityNone)
keyspace := &clusterInfo.ClusterInstance.Keyspaces[0]
shard0 := &keyspace.Shards[0]
// find primary from topo
curPrimary := utils.ShardPrimaryTablet(t, clusterInfo, keyspace, shard0)
assert.NotNil(t, curPrimary, "should have elected a primary")
vtOrcProcess := clusterInfo.ClusterInstance.VTOrcProcesses[0]
utils.WaitForSuccessfulRecoveryCount(t, vtOrcProcess, logic.ElectNewPrimaryRecoveryName, keyspace.Name, shard0.Name, 1)
utils.WaitForSuccessfulPRSCount(t, vtOrcProcess, keyspace.Name, shard0.Name, 1)

// find the replica and rdonly tablets
var replica, rdonly *cluster.Vttablet
for _, tablet := range shard0.Vttablets {
// we know we have only two replcia tablets, so the one not the primary must be the other replica
if tablet.Alias != curPrimary.Alias && tablet.Type == "replica" {
replica = tablet
}
if tablet.Type == "rdonly" {
rdonly = tablet
}
}
assert.NotNil(t, replica, "could not find replica tablet")
assert.NotNil(t, rdonly, "could not find rdonly tablet")

// check that the replication is setup correctly before we failover
utils.CheckReplication(t, clusterInfo, curPrimary, []*cluster.Vttablet{rdonly, replica}, 10*time.Second)

// check before ERS disabled state is false
utils.WaitForShardERSDisabledState(t, vtOrcProcess, keyspace.Name, shard0.Name, false)

// disable ERS on the keyspace via SetVtorcEmergencyReparent --disable
_, err := clusterInfo.ClusterInstance.VtctldClientProcess.ExecuteCommandWithOutput("SetVtorcEmergencyReparent", "--disable", keyspace.Name)
assert.NoError(t, err)
utils.WaitForShardERSDisabledState(t, vtOrcProcess, keyspace.Name, shard0.Name, true)
utils.CheckVarExists(t, vtOrcProcess, "EmergencyReparentShardDisabled")
utils.CheckMetricExists(t, vtOrcProcess, "vtorc_emergency_reparent_shard_disabled")

// make the current primary vttablet unavailable
err = curPrimary.VttabletProcess.TearDown()
assert.NoError(t, err)
err = curPrimary.MysqlctlProcess.Stop()
assert.NoError(t, err)
defer func() {
// we remove the tablet from our global list
utils.PermanentlyRemoveVttablet(clusterInfo, curPrimary)
}()

// check ERS did not occur. For the RecoverDeadPrimary recovery, expect 1 skipped recovery, 0 successful recoveries and 0 ERS operations.
utils.WaitForSkippedRecoveryCount(t, vtOrcProcess, logic.RecoverDeadPrimaryRecoveryName, keyspace.Name, shard0.Name, 1)
utils.WaitForSuccessfulRecoveryCount(t, vtOrcProcess, logic.RecoverDeadPrimaryRecoveryName, keyspace.Name, shard0.Name, 0)
utils.WaitForSuccessfulERSCount(t, vtOrcProcess, keyspace.Name, shard0.Name, 0)

// check that the shard primary remains the same because ERS is disabled on the keyspace
origPrimary := curPrimary
curPrimary = utils.ShardPrimaryTablet(t, clusterInfo, keyspace, shard0)
assert.NotNil(t, curPrimary)
assert.Equal(t, origPrimary.Alias, curPrimary.Alias)

// enable ERS on the keyspace via SetVtorcEmergencyReparent --enable
_, err = clusterInfo.ClusterInstance.VtctldClientProcess.ExecuteCommandWithOutput("SetVtorcEmergencyReparent", "--enable", keyspace.Name)
assert.NoError(t, err)
utils.WaitForShardERSDisabledState(t, vtOrcProcess, keyspace.Name, shard0.Name, false)

// check that the replica gets promoted by vtorc
utils.CheckPrimaryTablet(t, clusterInfo, replica, true)

// also check that the replication is working correctly after failover
utils.VerifyWritesSucceed(t, clusterInfo, replica, []*cluster.Vttablet{rdonly}, 10*time.Second)
utils.WaitForSuccessfulRecoveryCount(t, vtOrcProcess, logic.RecoverDeadPrimaryRecoveryName, keyspace.Name, shard0.Name, 1)
utils.WaitForSuccessfulERSCount(t, vtOrcProcess, keyspace.Name, shard0.Name, 1)
t.Run("Check ERS and PRS Vars and Metrics", func(t *testing.T) {
utils.CheckVarExists(t, vtOrcProcess, "EmergencyReparentCounts")
utils.CheckVarExists(t, vtOrcProcess, "PlannedReparentCounts")
utils.CheckVarExists(t, vtOrcProcess, "ReparentShardOperationTimings")

// Metrics registered in prometheus
utils.CheckMetricExists(t, vtOrcProcess, "vtorc_emergency_reparent_counts")
utils.CheckMetricExists(t, vtOrcProcess, "vtorc_planned_reparent_counts")
utils.CheckMetricExists(t, vtOrcProcess, "vtorc_reparent_shard_operation_timings_bucket")
})
}

// bring down primary before VTOrc has started, let vtorc repair.
func TestDownPrimaryBeforeVTOrc(t *testing.T) {
defer utils.PrintVTOrcLogsOnFailure(t, clusterInfo.ClusterInstance)
Expand Down
Loading
Loading