slack-19.0: support a minority of lagging tablets in ERS#677
Merged
tanjinx merged 34 commits intoslack-19.0from Aug 28, 2025
Merged
slack-19.0: support a minority of lagging tablets in ERS#677tanjinx merged 34 commits intoslack-19.0from
slack-19.0: support a minority of lagging tablets in ERS#677tanjinx merged 34 commits intoslack-19.0from
Conversation
…ay logs Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…d candidate selection Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
tanjinx
approved these changes
Aug 28, 2025
tanjinx
added a commit
that referenced
this pull request
Feb 12, 2026
* `EmergencyReparentShard`: wait only for majority of most advanced relay logs Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix bad cherry-pick Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix header, rename var Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * cleanup Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * cleanup Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * `EmergencyReparentShard`: include SQL thread position in most-advanced candidate selection Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * additional tests Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix test Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix tests Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * add source uuid Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * test cleanup Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix MySQL56GTIDSet sort Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * lint Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * lint again Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * support sort optimization in both paths Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix comment Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix cond Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * revert to simpler sorter Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix subtest name Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * log skipped candidates Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix test Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * update .AtLeast(), move to map[string]*RelayLogPositions Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix bad conflict fix Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * check for empty pointer in gtid logic Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix .IsZero() Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * remove conditional on status.RelayLogPosition Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Co-authored-by: Tanjin Xu <109303790+tanjinx@users.noreply.github.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
5 tasks
tanjinx
added a commit
that referenced
this pull request
Feb 14, 2026
PR #677 changed ERS to prioritize data safety by only considering the majority of most-advanced replicas for promotion. This excludes lagging replicas even if they have Prefer promotion rules. This test expects a lagging replica with a Prefer rule to catch up and then be promoted, but the new behavior removes it from consideration before the catch-up phase. Slack prioritizes data safety over promotion preferences in failover scenarios. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
tanjinx
added a commit
that referenced
this pull request
Feb 16, 2026
… ERS (#677) (#798) * `slack-19.0`: support a minority of lagging tablets in ERS (#677) * `EmergencyReparentShard`: wait only for majority of most advanced relay logs Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix bad cherry-pick Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix header, rename var Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * cleanup Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * cleanup Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * `EmergencyReparentShard`: include SQL thread position in most-advanced candidate selection Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * additional tests Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix test Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix tests Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * add source uuid Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * test cleanup Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix MySQL56GTIDSet sort Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * lint Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * lint again Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * support sort optimization in both paths Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix comment Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix cond Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * revert to simpler sorter Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix subtest name Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * log skipped candidates Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix test Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * update .AtLeast(), move to map[string]*RelayLogPositions Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix bad conflict fix Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * check for empty pointer in gtid logic Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix .IsZero() Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * remove conditional on status.RelayLogPosition Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Co-authored-by: Tanjin Xu <109303790+tanjinx@users.noreply.github.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Fix libdbd-mysql-perl dependency issues on Ubuntu 24.04 Ubuntu 24.04 rebuilt libdbd-mysql-perl to depend on libperconaserverclient22 (from MySQL/Percona 8.4), which is not available when using MySQL/Percona 8.0. This causes installation failures in CI workflows. Since libdbd-mysql-perl and percona-toolkit are not needed for our tests, this commit uses APT preferences pinning to block their installation: - Pin-Priority: -1 prevents these packages from being installed - Added --no-install-recommends flag to percona-xtrabackup-80 installation Changes: - Updated workflow template: test/templates/cluster_endtoend_test.tpl - Regenerated 3 cluster workflows that use xtrabackup - Manually updated 6 upgrade/downgrade workflows that use xtrabackup Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Remove libdbd-mysql-perl from Docker install_dependencies.sh The previous commit fixed CI workflows but missed the Docker build script. install_dependencies.sh was still trying to install libdbd-mysql-perl and percona-toolkit as BASE_PACKAGES, causing Docker builds to fail on Ubuntu 24.04. These packages are not needed for Vitess functionality, so removing them from the BASE_PACKAGES array. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Remove libdbd-mysql-perl from bootstrap Dockerfiles The Dockerfiles in docker/bootstrap/ were still explicitly installing libdbd-mysql-perl, which causes build failures on Ubuntu 24.04. Removed libdbd-mysql-perl from: - Dockerfile.mysql80 - Dockerfile.mysql84 - Dockerfile.percona80 This package is not needed for Vitess functionality. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Add --no-install-recommends to bootstrap Dockerfiles The bootstrap Dockerfiles for mysql80 and mysql84 were installing packages without --no-install-recommends flag. This caused percona-xtrabackup-80/84 to pull in libdbd-mysql-perl as a recommended dependency, which fails on Ubuntu 24.04 due to libperconaserverclient22 dependency. Added --no-install-recommends to prevent recommended packages from being installed automatically. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Install libdbd-mysql-perl from Ubuntu repos before adding Percona repos The key issue was that Ubuntu 24.04's libdbd-mysql-perl works fine with system libraries, but the Percona repository has a version that depends on libperconaserverclient22 (Percona 8.4), which conflicts with our MySQL/Percona 8.0 setup. Solution: Install libdbd-mysql-perl from Ubuntu repos BEFORE adding Percona repositories. This way we get the Ubuntu version, and when we later add Percona repos and install percona-xtrabackup, apt won't try to upgrade libdbd-mysql-perl to the incompatible Percona version. Changes: - Updated workflow template to install libdbd-mysql-perl early - Regenerated 3 cluster workflows - Fixed 6 upgrade/downgrade workflows manually - Updated all 3 bootstrap Dockerfiles - Restored libdbd-mysql-perl to docker/utils/install_dependencies.sh Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Add percona-toolkit back to install_dependencies.sh Now that libdbd-mysql-perl is installed from Ubuntu repos before adding Percona repositories, percona-toolkit can also be installed safely from Ubuntu repos. Both packages will be the Ubuntu versions and won't have the libperconaserverclient22 dependency issue. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Hold libaio1 package and add --no-install-recommends to prevent upgrades After setup-mysql manually installs an old version of libaio1 (to work around Ubuntu 24.04 issues), subsequent apt-get install commands were upgrading it, causing MySQL 5.7 binaries to fail with 'cannot open shared object file'. Changes: 1. Hold libaio1 package after manual installation (apt-mark hold) 2. Add --no-install-recommends to all apt-get install commands in workflows This prevents apt from upgrading the manually installed libaio1 package. Fixes: cluster_endtoend_vreplication_across_db_versions test failures Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Remove --no-install-recommends flags from workflows The --no-install-recommends flags were preventing necessary dependencies from being installed, which may be causing the libaio.so.1 issue. Reverting to allow recommended packages to be installed normally. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Revert libaio1 hold in setup-mysql action Remove apt-mark hold libaio1 from setup-mysql action as it doesn't help with the libaio.so.1 issue. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Remove --no-install-recommends from bootstrap Dockerfiles Remove --no-install-recommends flags from DEBIAN_FRONTEND apt-get install commands to allow recommended packages to be installed. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Add back --no-install-recommends in Dockerfile.common Keep --no-install-recommends for the common base image to minimize image size, but allow recommended packages in mysql80 and mysql84 Dockerfiles where they may be needed. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Revert all changes to docker/utils/install_dependencies.sh Restore docker/utils/install_dependencies.sh to its original state from before the libdbd-mysql-perl fixes. The file will have both libdbd-mysql-perl and percona-toolkit in their original positions. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * try to fix setup mysql (vitessio#19371) Signed-off-by: Mohamed Hamza <mhamza@fastmail.com> * Skip TestDownPrimaryPromotionRuleWithLagCrossCenter This test validates PreventCrossCellFailover configuration which Slack does not use in production. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Skip TestDownPrimaryPromotionRuleWithLag PR #677 changed ERS to prioritize data safety by only considering the majority of most-advanced replicas for promotion. This excludes lagging replicas even if they have Prefer promotion rules. This test expects a lagging replica with a Prefer rule to catch up and then be promoted, but the new behavior removes it from consideration before the catch-up phase. Slack prioritizes data safety over promotion preferences in failover scenarios. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> Signed-off-by: Mohamed Hamza <mhamza@fastmail.com> Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com> Co-authored-by: Claude <svc-devxp-claude@slack-corp.com> Co-authored-by: Mohamed Hamza <mhamza@fastmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR is an early-backport of vitessio#18531 to
slack-19.0I've addressed a first wave of PR comments/suggestions. If more changes happen upstream, a 2nd+ PR will be made
Related Issue(s)
EmergencyReparentShardto fail vitessio/vitess#18529EmergencyReparentShard: include SQL thread position in most-advanced candidate selection vitessio/vitess#18531Checklist
Deployment Notes