Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
1b5760d
Wrap errors happening during PRS in VT15001
frouioui Jan 30, 2025
1d4855b
rollback when markSavepoint fails
frouioui Jan 31, 2025
a41541a
add comment
frouioui Jan 31, 2025
2af7e91
wip - add multi shards test case
frouioui Feb 3, 2025
6513c15
use vt15001 in more places and rollback as soon as we detect this error
frouioui Feb 3, 2025
dd3b498
revert / comment changes in tests
frouioui Feb 3, 2025
b66a4ef
fix panic
frouioui Feb 3, 2025
f11d7f9
wip
frouioui Feb 4, 2025
5ea8a96
Revert unwanted configuration change
frouioui Feb 12, 2025
ebe6aab
Fix TestReplicaTransactions assertion
frouioui Feb 12, 2025
1ecc210
Wrap tx errors in VT15001
frouioui Feb 12, 2025
b64827b
Fix E2E test and rename VT15001
frouioui Feb 12, 2025
b3fd5ca
Improve E2E tests for errors in tx
frouioui Feb 13, 2025
7135534
Wait for main action to complete before setting tablet to SERVING
frouioui Feb 13, 2025
512cc5c
Lower flakiness by using longer tx timeouts
frouioui Feb 14, 2025
db8249e
Merge remote-tracking branch 'origin/main' into improved-errors-durin…
frouioui Feb 14, 2025
83dba75
Check for multi-db commit failed warning
frouioui Feb 14, 2025
2829981
Remove unnecessary check
frouioui Feb 17, 2025
293720a
Block queries after VT15001 until ROLLBACK is received
frouioui Feb 19, 2025
c7ad375
Wrap healthcheck error only when in a tx
frouioui Feb 19, 2025
812df8f
review suggestions: better proto name and code comments
frouioui Feb 20, 2025
89060c8
review suggestions: more unit tests
frouioui Feb 20, 2025
a43a4d4
revert unwanted change to shard_buffer.go
frouioui Feb 20, 2025
f5ee299
wrap grpc errors in wrappedService
frouioui Feb 20, 2025
83dc5aa
fix TestReplicaTransactions
frouioui Feb 20, 2025
5baee4d
fix condition check to wrap error
frouioui Feb 20, 2025
f10eb16
allow SHOW to stop blocking queries after a VT15001
frouioui Feb 24, 2025
949b97f
apply review suggestions
frouioui Feb 26, 2025
a34c4fc
Merge remote-tracking branch 'origin/main' into improved-errors-durin…
frouioui Feb 26, 2025
2f0bae1
Move all the wrapping to vtgate + more tests
frouioui Feb 27, 2025
5aac23c
add more tests
frouioui Feb 27, 2025
518b18c
fix test expectations
frouioui Feb 27, 2025
0eea900
Merge remote-tracking branch 'origin/main' into improved-errors-durin…
frouioui Mar 1, 2025
abf6a6d
Merge remote-tracking branch 'origin/main' into improved-errors-durin…
frouioui Mar 3, 2025
381351e
Merge remote-tracking branch 'origin/main' into improved-errors-durin…
frouioui Mar 3, 2025
904da55
Move VT15002 to VT09032
frouioui Mar 10, 2025
c3f790e
Fix namings and comments
frouioui Mar 10, 2025
fa29793
Change functions to be reusable
frouioui Mar 10, 2025
c454768
Merge remote-tracking branch 'origin/main' into improved-errors-durin…
frouioui Mar 18, 2025
01530c9
apply review suggestions
frouioui Mar 18, 2025
f741e62
modify release notes
frouioui Mar 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions changelog/22.0/22.0.0/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
- **[KeyRanges in `--clusters_to_watch` in VTOrc](#key-range-vtorc)**
- **[Support for Filtering Query logs on Error](#query-logs)**
- **[Semi-sync monitor in vttablet](#semi-sync-monitor)**
- **[Wrapped fatal transaction errors](#new-errors-fatal-tx)**
- **[Minor Changes](#minor-changes)**
- **[VTTablet Flags](#flags-vttablet)**
- **[VTTablet ACL enforcement and reloading](#reloading-vttablet-acl)**
Expand Down Expand Up @@ -262,6 +263,22 @@ To address this, the new component continuously monitors the semi-sync status. I

The monitoring interval can be adjusted using the `--semi-sync-monitor-interval` flag, which defaults to 10 seconds.

---

### <a id="new-errors-fatal-tx"/>Wrapped fatal transaction errors</a>


When a query fails while being in a transaction, due to the transaction no longer being valid (e.g. PRS, rollout, primary down, etc), the original error is now wrapped around a `VT15001` error.

For non-transactional queries that produce a `VT15001`, VTGate will try to rollback and clear the transaction.
Any new queries on the same connection will fail with a `VT09032` error, until a `ROLLBACK` is received
to acknowledge that the transaction was automatically rolled back and cleared by VTGate.

`VT09032` is returned to clients to avoid applications blindly sending queries to VTGate thinking they are still in a transaction.

This change was introduced by [#17669](https://github.com/vitessio/vitess/pull/17669).


---

## <a id="minor-changes"/>Minor Changes</a>
Expand Down
2 changes: 1 addition & 1 deletion go/test/endtoend/reparent/newfeaturetest/reparent_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ func TestERSWithWriteInPromoteReplica(t *testing.T) {
}

func TestBufferingWithMultipleDisruptions(t *testing.T) {
clusterInstance := utils.SetupShardedReparentCluster(t, policy.DurabilitySemiSync)
clusterInstance := utils.SetupShardedReparentCluster(t, policy.DurabilitySemiSync, nil)
defer utils.TeardownCluster(clusterInstance)

// Stop all VTOrc instances, so that they don't interfere with the test.
Expand Down
224 changes: 224 additions & 0 deletions go/test/endtoend/reparent/newfeaturetest/reparent_with_open_tx_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
/*
Copyright 2025 The Vitess Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package newfeaturetest

import (
"context"
"testing"
"time"

"github.com/stretchr/testify/require"

"vitess.io/vitess/go/vt/vterrors"

"vitess.io/vitess/go/mysql"
"vitess.io/vitess/go/test/endtoend/cluster"
"vitess.io/vitess/go/test/endtoend/reparent/utils"
"vitess.io/vitess/go/vt/vtctl/reparentutil/policy"
)

var primary int

// This test ensures that we get a VT15001 when doing a commit while the primary is down.
func testCommitError(t *testing.T, conn *mysql.Conn, clusterInstance *cluster.LocalProcessCluster, tablets []*cluster.Vttablet) {
tabletStopped := make(chan bool)
commitDone := make(chan bool)
idx := 1
createTxAndInsertRows(conn, t, &idx)

go func() {
<-tabletStopped
_, err := conn.ExecuteFetch("commit", 0, false)
require.ErrorContains(t, err, "VT15001")
commitDone <- true
}()

reparent(t, clusterInstance, tablets, tabletStopped, commitDone)

_, err := conn.ExecuteFetch("delete from vt_insert_test", 0, false)
require.NoError(t, err)
}

// This test ensures that we are getting a VT15001 when executing a query on an open transaction
// while the primary is down. Subsequent queries should fail with a VT09032 until a ROLLBACK
// or SHOW WARNINGS is issued.
func testExecuteError(t *testing.T, conn *mysql.Conn, clusterInstance *cluster.LocalProcessCluster, tablets []*cluster.Vttablet) {
tabletStopped := make(chan bool)
executeDone := make(chan bool)
idx := 1
createTxAndInsertRows(conn, t, &idx)

go func() {
idx += 5
<-tabletStopped
_, err := conn.ExecuteFetch(utils.GetInsertMultipleValuesQuery(idx, idx+1, idx+2, idx+3), 0, false)
require.ErrorContains(t, err, "VT15001")

// Subsequent queries after a VT15001 should start returning a VT09032 error until we issue a ROLLBACK
_, err = conn.ExecuteFetch("select * from vt_insert_test", 1, false)
require.ErrorContains(t, err, "VT09032")

_, err = conn.ExecuteFetch("rollback", 0, false)
require.NoError(t, err)
executeDone <- true
}()

reparent(t, clusterInstance, tablets, tabletStopped, executeDone)

// if the unhealthy shard is the first one where we commited, let's assert that the table is empty on all the shards
r, err := conn.ExecuteFetch("select * from vt_insert_test", 1, false)
require.NoError(t, err)
require.Len(t, r.Rows, 0)
}

func testExecuteErrorWhileTabletIsNotServing(t *testing.T, conn *mysql.Conn, clusterInstance *cluster.LocalProcessCluster, tablets []*cluster.Vttablet) {
tabletNotServing := make(chan bool)
executeDone := make(chan bool)
idx := 1
createTxAndInsertRows(conn, t, &idx)

go func() {
idx += 5
<-tabletNotServing
_, err := conn.ExecuteFetch(utils.GetInsertMultipleValuesQuery(idx, idx+1, idx+2, idx+3), 0, false)
require.ErrorContains(t, err, "VT15001")
require.ErrorContains(t, err, vterrors.WrongTablet)

// Subsequent queries after a VT15001 should start returning a VT09032 error until we issue a ROLLBACK
_, err = conn.ExecuteFetch("select * from vt_insert_test", 1, false)
require.ErrorContains(t, err, "VT09032")

_, err = conn.ExecuteFetch("rollback", 0, false)
require.NoError(t, err)
executeDone <- true
}()

makeTabletNotServing(t, clusterInstance, tablets, tabletNotServing, executeDone)

// if the unhealthy shard is the first one where we commited, let's assert that the table is empty on all the shards
r, err := conn.ExecuteFetch("select * from vt_insert_test", 1, false)
require.NoError(t, err)
require.Len(t, r.Rows, 0)
}

func createTxAndInsertRows(conn *mysql.Conn, t *testing.T, idx *int) {
_, err := conn.ExecuteFetch("begin", 0, false)
require.NoError(t, err)

for i := 0; i < 25; i++ {
*idx += 5
_, err = conn.ExecuteFetch(utils.GetInsertMultipleValuesQuery(*idx, *idx+1, *idx+2, *idx+3), 0, false)
require.NoError(t, err)
time.Sleep(10 * time.Millisecond)
}
}

func reparent(t *testing.T, clusterInstance *cluster.LocalProcessCluster, tablets []*cluster.Vttablet, tabletStopped, actionDone chan bool) {
// Reparent to the other replica
utils.ShardName = "40-80"
defer func() {
utils.ShardName = "0"
}()

prsTo := primary - 1
if primary == 0 {
prsTo = primary + 1
}
output, err := utils.Prs(t, clusterInstance, tablets[prsTo])
require.NoError(t, err, "error in PlannedReparentShard output - %s", output)

// We now restart the vttablet that became a replica.
utils.StopTablet(t, tablets[primary], false)
tabletStopped <- true

// Wait for the action triggering the VT15001 to be done before moving on
<-actionDone

tablets[primary].VttabletProcess.ServingStatus = "SERVING"
err = tablets[primary].VttabletProcess.Setup()
require.NoError(t, err)
primary = prsTo
}

func makeTabletNotServing(t *testing.T, clusterInstance *cluster.LocalProcessCluster, tablets []*cluster.Vttablet, stateChanged, actionDone chan bool) {
// Reparent to the other replica
utils.ShardName = "40-80"
defer func() {
utils.ShardName = "0"
}()

prsTo := primary - 1
if primary == 0 {
prsTo = primary + 1
}
output, err := utils.Prs(t, clusterInstance, tablets[prsTo])
require.NoError(t, err, "error in PlannedReparentShard output - %s", output)

// We now restart the vttablet that became a replica.
utils.StopTablet(t, tablets[primary], false)
tablets[primary].VttabletProcess.ServingStatus = "NOT_SERVING"
err = tablets[primary].VttabletProcess.Setup()
require.NoError(t, err)
stateChanged <- true

// Wait for the action triggering the VT15001 to be done before moving on
<-actionDone

utils.StopTablet(t, tablets[primary], false)
tablets[primary].VttabletProcess.ServingStatus = "SERVING"
err = tablets[primary].VttabletProcess.Setup()
require.NoError(t, err)
primary = prsTo
}

func TestErrorsInTransaction(t *testing.T) {
clusterInstance := utils.SetupShardedReparentCluster(t, policy.DurabilitySemiSync, []string{
"--queryserver-config-transaction-timeout", "5m",
"--queryserver-config-query-timeout", "5m",
})

defer utils.TeardownCluster(clusterInstance)

keyspace := clusterInstance.Keyspaces[0]
vtParams := clusterInstance.GetVTParams(keyspace.Name)
tablets := clusterInstance.Keyspaces[0].Shards[1].Vttablets

primary = 0

// Start by reparenting all the shards to the first tablet.
// Confirm that the replication is setup correctly in the beginning.
// tablets[0] is the primary tablet in the beginning.
utils.ConfirmReplication(t, tablets[primary], []*cluster.Vttablet{tablets[1], tablets[2]})

conn, err := mysql.Connect(context.Background(), &vtParams)
require.NoError(t, err)

_, err = conn.ExecuteFetch("delete from vt_insert_test", 0, false)
require.NoError(t, err)

t.Run("commit while reparenting", func(t *testing.T) {
testCommitError(t, conn, clusterInstance, tablets)
})

t.Run("execute DML while tablet is NOT_SERVING", func(t *testing.T) {
testExecuteErrorWhileTabletIsNotServing(t, conn, clusterInstance, tablets)
})

t.Run("execute DML while reparenting", func(t *testing.T) {
testExecuteError(t, conn, clusterInstance, tablets)
})
}
27 changes: 19 additions & 8 deletions go/test/endtoend/reparent/utils/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,14 @@ import (
)

var (
KeyspaceName = "ks"
dbName = "vt_" + KeyspaceName
username = "vt_dba"
Hostname = "localhost"
insertVal = 1
insertSQL = "insert into vt_insert_test(id, msg) values (%d, 'test %d')"
sqlSchema = `
KeyspaceName = "ks"
dbName = "vt_" + KeyspaceName
username = "vt_dba"
Hostname = "localhost"
insertVal = 1
insertSQL = "insert into vt_insert_test(id, msg) values (%d, 'test %d')"
insertSQLMultipleValues = "insert into vt_insert_test(id, msg) values (%d, 'test %d'), (%d, 'test %d'), (%d, 'test %d'), (%d, 'test %d')"
sqlSchema = `
create table vt_insert_test (
id bigint,
msg varchar(64),
Expand All @@ -76,7 +77,7 @@ func SetupRangeBasedCluster(ctx context.Context, t *testing.T) *cluster.LocalPro
}

// SetupShardedReparentCluster is used to setup a sharded cluster for testing
func SetupShardedReparentCluster(t *testing.T, durability string) *cluster.LocalProcessCluster {
func SetupShardedReparentCluster(t *testing.T, durability string, extraVttabletFlags []string) *cluster.LocalProcessCluster {
clusterInstance := cluster.NewCluster(cell1, Hostname)
// Start topo server
err := clusterInstance.StartTopo()
Expand All @@ -88,6 +89,11 @@ func SetupShardedReparentCluster(t *testing.T, durability string) *cluster.Local
"--health_check_interval", "1s",
"--track_schema_versions=true",
"--queryserver_enable_online_ddl=false")

if len(extraVttabletFlags) > 0 {
clusterInstance.VtTabletExtraArgs = append(clusterInstance.VtTabletExtraArgs, extraVttabletFlags...)
}

clusterInstance.VtGateExtraArgs = append(clusterInstance.VtGateExtraArgs,
"--enable_buffer",
// Long timeout in case failover is slow.
Expand Down Expand Up @@ -117,6 +123,11 @@ func GetInsertQuery(idx int) string {
return fmt.Sprintf(insertSQL, idx, idx)
}

// GetInsertMultipleValuesQuery returns a built insert query to insert multiple rows at once.
func GetInsertMultipleValuesQuery(idx1, idx2, idx3, idx4 int) string {
return fmt.Sprintf(insertSQLMultipleValues, idx1, idx1, idx2, idx2, idx3, idx3, idx4, idx4)
}

// GetSelectionQuery returns a built selection query read the data.
func GetSelectionQuery() string {
return `select * from vt_insert_test`
Expand Down
5 changes: 3 additions & 2 deletions go/test/endtoend/tabletgateway/vtgate_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@ func TestReplicaTransactions(t *testing.T) {
_ = replicaTablet.VttabletProcess.TearDown()
// Healthcheck interval on tablet is set to 1s, so sleep for 2s
time.Sleep(2 * time.Second)
utils.AssertContainsError(t, readConn, fetchAllCustomers, "is either down or nonexistent")
utils.AssertContainsMultipleErrors(t, readConn, fetchAllCustomers, "VT15001", "is either down or nonexistent")

// bring up the tablet again
// trying to use the same session/transaction should fail as the vtgate has
Expand All @@ -271,7 +271,8 @@ func TestReplicaTransactions(t *testing.T) {
require.NoError(t, err)
serving := replicaTablet.VttabletProcess.WaitForStatus("SERVING", 60*time.Second)
assert.Equal(t, serving, true, "Tablet did not become ready within a reasonable time")
utils.AssertContainsError(t, readConn, fetchAllCustomers, "not found")
utils.AssertContainsError(t, readConn, fetchAllCustomers, "VT09032")
utils.Exec(t, readConn, "rollback")

// create a new connection, should be able to query again
readConn, err = mysql.Connect(ctx, &vtParams)
Expand Down
11 changes: 11 additions & 0 deletions go/test/endtoend/utils/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,17 @@ func AssertContainsError(t *testing.T, conn *mysql.Conn, query, expected string)
assert.ErrorContains(t, err, expected, "actual error: %s", err.Error())
}

// AssertContainsMultipleErrors acts the same way as AssertContainsError, but it will assert that
// multiple sub-strings are present in the error
func AssertContainsMultipleErrors(t *testing.T, conn *mysql.Conn, query string, expected ...string) {
t.Helper()
_, err := ExecAllowError(t, conn, query)
require.Error(t, err)
for _, s := range expected {
assert.ErrorContains(t, err, s, "actual error: %s", err.Error())
}
}

// AssertMatchesNoOrder executes the given query and makes sure it matches the given `expected` string.
// The order applied to the results or expectation is ignored. They are both re-sorted.
func AssertMatchesNoOrder(t *testing.T, conn *mysql.Conn, query, expected string) {
Expand Down
1 change: 0 additions & 1 deletion go/vt/discovery/healthcheck.go
Original file line number Diff line number Diff line change
Expand Up @@ -889,7 +889,6 @@ func (hc *HealthCheckImpl) TabletConnection(ctx context.Context, alias *topodata
thc := hc.healthByAlias[tabletAliasString(topoproto.TabletAliasString(alias))]
hc.mu.Unlock()
if thc == nil || thc.Conn == nil {
// TODO: test that throws this error
return nil, vterrors.Errorf(vtrpc.Code_NOT_FOUND, "tablet: %v is either down or nonexistent", alias)
}
return thc.Connection(ctx), nil
Expand Down
Loading
Loading