Skip to content

slack-19.0: support a minority of lagging tablets in ERS#677

Merged
tanjinx merged 34 commits intoslack-19.0from
ers-support-lag.slack-19.0
Aug 28, 2025
Merged

slack-19.0: support a minority of lagging tablets in ERS#677
tanjinx merged 34 commits intoslack-19.0from
ers-support-lag.slack-19.0

Conversation

@timvaillancourt
Copy link

@timvaillancourt timvaillancourt commented Jul 14, 2025

Description

This PR is an early-backport of vitessio#18531 to slack-19.0

I've addressed a first wave of PR comments/suggestions. If more changes happen upstream, a 2nd+ PR will be made

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

…ay logs

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@github-actions github-actions bot added this to the v19.0.7 milestone Jul 14, 2025
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…d candidate selection

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt timvaillancourt self-assigned this Aug 25, 2025
@timvaillancourt timvaillancourt marked this pull request as ready for review August 25, 2025 19:05
@timvaillancourt timvaillancourt requested a review from a team as a code owner August 25, 2025 19:05
@tanjinx tanjinx merged commit f329e83 into slack-19.0 Aug 28, 2025
161 of 167 checks passed
@tanjinx tanjinx deleted the ers-support-lag.slack-19.0 branch August 28, 2025 00:35
tanjinx added a commit that referenced this pull request Feb 12, 2026
* `EmergencyReparentShard`: wait only for majority of most advanced relay logs

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix bad cherry-pick

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix header, rename var

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* `EmergencyReparentShard`: include SQL thread position in most-advanced candidate selection

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* additional tests

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix test

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix tests

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* add source uuid

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* test cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix MySQL56GTIDSet sort

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* lint

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* lint again

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* support sort optimization in both paths

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix comment

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix cond

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* revert to simpler sorter

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix subtest name

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* log skipped candidates

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix test

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* update .AtLeast(), move to map[string]*RelayLogPositions

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix bad conflict fix

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* check for empty pointer in gtid logic

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix .IsZero()

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* remove conditional on status.RelayLogPosition

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

---------

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: Tanjin Xu <109303790+tanjinx@users.noreply.github.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
tanjinx added a commit that referenced this pull request Feb 14, 2026
PR #677 changed ERS to prioritize data safety by only considering the
majority of most-advanced replicas for promotion. This excludes lagging
replicas even if they have Prefer promotion rules.

This test expects a lagging replica with a Prefer rule to catch up and
then be promoted, but the new behavior removes it from consideration
before the catch-up phase. Slack prioritizes data safety over promotion
preferences in failover scenarios.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
tanjinx added a commit that referenced this pull request Feb 16, 2026
… ERS (#677) (#798)

* `slack-19.0`: support a minority of lagging tablets in ERS (#677)

* `EmergencyReparentShard`: wait only for majority of most advanced relay logs

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix bad cherry-pick

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix header, rename var

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* `EmergencyReparentShard`: include SQL thread position in most-advanced candidate selection

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* additional tests

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix test

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix tests

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* add source uuid

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* test cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix MySQL56GTIDSet sort

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* lint

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* lint again

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* support sort optimization in both paths

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix comment

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix cond

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* revert to simpler sorter

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix subtest name

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* log skipped candidates

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix test

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* update .AtLeast(), move to map[string]*RelayLogPositions

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix bad conflict fix

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* check for empty pointer in gtid logic

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix .IsZero()

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* remove conditional on status.RelayLogPosition

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

---------

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: Tanjin Xu <109303790+tanjinx@users.noreply.github.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Fix libdbd-mysql-perl dependency issues on Ubuntu 24.04

Ubuntu 24.04 rebuilt libdbd-mysql-perl to depend on libperconaserverclient22
(from MySQL/Percona 8.4), which is not available when using MySQL/Percona 8.0.
This causes installation failures in CI workflows.

Since libdbd-mysql-perl and percona-toolkit are not needed for our tests,
this commit uses APT preferences pinning to block their installation:
- Pin-Priority: -1 prevents these packages from being installed
- Added --no-install-recommends flag to percona-xtrabackup-80 installation

Changes:
- Updated workflow template: test/templates/cluster_endtoend_test.tpl
- Regenerated 3 cluster workflows that use xtrabackup
- Manually updated 6 upgrade/downgrade workflows that use xtrabackup

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Remove libdbd-mysql-perl from Docker install_dependencies.sh

The previous commit fixed CI workflows but missed the Docker build script.
install_dependencies.sh was still trying to install libdbd-mysql-perl and
percona-toolkit as BASE_PACKAGES, causing Docker builds to fail on Ubuntu 24.04.

These packages are not needed for Vitess functionality, so removing them
from the BASE_PACKAGES array.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Remove libdbd-mysql-perl from bootstrap Dockerfiles

The Dockerfiles in docker/bootstrap/ were still explicitly installing
libdbd-mysql-perl, which causes build failures on Ubuntu 24.04.

Removed libdbd-mysql-perl from:
- Dockerfile.mysql80
- Dockerfile.mysql84
- Dockerfile.percona80

This package is not needed for Vitess functionality.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Add --no-install-recommends to bootstrap Dockerfiles

The bootstrap Dockerfiles for mysql80 and mysql84 were installing packages
without --no-install-recommends flag. This caused percona-xtrabackup-80/84
to pull in libdbd-mysql-perl as a recommended dependency, which fails on
Ubuntu 24.04 due to libperconaserverclient22 dependency.

Added --no-install-recommends to prevent recommended packages from being
installed automatically.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Install libdbd-mysql-perl from Ubuntu repos before adding Percona repos

The key issue was that Ubuntu 24.04's libdbd-mysql-perl works fine with
system libraries, but the Percona repository has a version that depends
on libperconaserverclient22 (Percona 8.4), which conflicts with our
MySQL/Percona 8.0 setup.

Solution: Install libdbd-mysql-perl from Ubuntu repos BEFORE adding
Percona repositories. This way we get the Ubuntu version, and when we
later add Percona repos and install percona-xtrabackup, apt won't try
to upgrade libdbd-mysql-perl to the incompatible Percona version.

Changes:
- Updated workflow template to install libdbd-mysql-perl early
- Regenerated 3 cluster workflows
- Fixed 6 upgrade/downgrade workflows manually
- Updated all 3 bootstrap Dockerfiles
- Restored libdbd-mysql-perl to docker/utils/install_dependencies.sh

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Add percona-toolkit back to install_dependencies.sh

Now that libdbd-mysql-perl is installed from Ubuntu repos before adding
Percona repositories, percona-toolkit can also be installed safely from
Ubuntu repos. Both packages will be the Ubuntu versions and won't have
the libperconaserverclient22 dependency issue.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Hold libaio1 package and add --no-install-recommends to prevent upgrades

After setup-mysql manually installs an old version of libaio1 (to work around
Ubuntu 24.04 issues), subsequent apt-get install commands were upgrading it,
causing MySQL 5.7 binaries to fail with 'cannot open shared object file'.

Changes:
1. Hold libaio1 package after manual installation (apt-mark hold)
2. Add --no-install-recommends to all apt-get install commands in workflows

This prevents apt from upgrading the manually installed libaio1 package.

Fixes: cluster_endtoend_vreplication_across_db_versions test failures

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Remove --no-install-recommends flags from workflows

The --no-install-recommends flags were preventing necessary dependencies
from being installed, which may be causing the libaio.so.1 issue.

Reverting to allow recommended packages to be installed normally.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Revert libaio1 hold in setup-mysql action

Remove apt-mark hold libaio1 from setup-mysql action as it doesn't help
with the libaio.so.1 issue.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Remove --no-install-recommends from bootstrap Dockerfiles

Remove --no-install-recommends flags from DEBIAN_FRONTEND apt-get install
commands to allow recommended packages to be installed.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Add back --no-install-recommends in Dockerfile.common

Keep --no-install-recommends for the common base image to minimize
image size, but allow recommended packages in mysql80 and mysql84
Dockerfiles where they may be needed.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Revert all changes to docker/utils/install_dependencies.sh

Restore docker/utils/install_dependencies.sh to its original state from
before the libdbd-mysql-perl fixes. The file will have both libdbd-mysql-perl
and percona-toolkit in their original positions.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* try to fix setup mysql (vitessio#19371)

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>

* Skip TestDownPrimaryPromotionRuleWithLagCrossCenter

This test validates PreventCrossCellFailover configuration which Slack
does not use in production.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Skip TestDownPrimaryPromotionRuleWithLag

PR #677 changed ERS to prioritize data safety by only considering the
majority of most-advanced replicas for promotion. This excludes lagging
replicas even if they have Prefer promotion rules.

This test expects a lagging replica with a Prefer rule to catch up and
then be promoted, but the new behavior removes it from consideration
before the catch-up phase. Slack prioritizes data safety over promotion
preferences in failover scenarios.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

---------

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: Claude <svc-devxp-claude@slack-corp.com>
Co-authored-by: Mohamed Hamza <mhamza@fastmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants