Skip to content

[slack-22.0] forward-port: support a minority of lagging tablets in ERS (#677)#798

Merged
tanjinx merged 16 commits intoslack-22.0from
forward-port-677
Feb 16, 2026
Merged

[slack-22.0] forward-port: support a minority of lagging tablets in ERS (#677)#798
tanjinx merged 16 commits intoslack-22.0from
forward-port-677

Conversation

@tanjinx
Copy link

@tanjinx tanjinx commented Feb 12, 2026

forward port #677

Description

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

* `EmergencyReparentShard`: wait only for majority of most advanced relay logs

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix bad cherry-pick

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix header, rename var

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* `EmergencyReparentShard`: include SQL thread position in most-advanced candidate selection

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* additional tests

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix test

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix tests

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* add source uuid

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* test cleanup

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix MySQL56GTIDSet sort

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* lint

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* lint again

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* support sort optimization in both paths

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix comment

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix cond

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* revert to simpler sorter

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix subtest name

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* log skipped candidates

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix test

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* update .AtLeast(), move to map[string]*RelayLogPositions

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix bad conflict fix

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* check for empty pointer in gtid logic

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix .IsZero()

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* remove conditional on status.RelayLogPosition

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

---------

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: Tanjin Xu <109303790+tanjinx@users.noreply.github.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
@github-actions github-actions bot added this to the v22.0.3 milestone Feb 12, 2026
@tanjinx tanjinx changed the title slack-19.0: support a minority of lagging tablets in ERS (#677) [slack-22.0] forward-port: support a minority of lagging tablets in ERS (#677) Feb 12, 2026
@tanjinx tanjinx force-pushed the forward-port-677 branch 2 times, most recently from 6dd0a6d to ea3fe4b Compare February 13, 2026 00:22
@codecov-commenter
Copy link

codecov-commenter commented Feb 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.80%. Comparing base (946a513) to head (e90f73c).

Additional details and impacted files
@@              Coverage Diff               @@
##           slack-22.0     #798      +/-   ##
==============================================
+ Coverage       69.77%   69.80%   +0.02%     
==============================================
  Files            1605     1605              
  Lines          213999   214027      +28     
==============================================
+ Hits           149324   149400      +76     
+ Misses          64675    64627      -48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Ubuntu 24.04 rebuilt libdbd-mysql-perl to depend on libperconaserverclient22
(from MySQL/Percona 8.4), which is not available when using MySQL/Percona 8.0.
This causes installation failures in CI workflows.

Since libdbd-mysql-perl and percona-toolkit are not needed for our tests,
this commit uses APT preferences pinning to block their installation:
- Pin-Priority: -1 prevents these packages from being installed
- Added --no-install-recommends flag to percona-xtrabackup-80 installation

Changes:
- Updated workflow template: test/templates/cluster_endtoend_test.tpl
- Regenerated 3 cluster workflows that use xtrabackup
- Manually updated 6 upgrade/downgrade workflows that use xtrabackup

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
tanjinx and others added 12 commits February 12, 2026 17:12
The previous commit fixed CI workflows but missed the Docker build script.
install_dependencies.sh was still trying to install libdbd-mysql-perl and
percona-toolkit as BASE_PACKAGES, causing Docker builds to fail on Ubuntu 24.04.

These packages are not needed for Vitess functionality, so removing them
from the BASE_PACKAGES array.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
The Dockerfiles in docker/bootstrap/ were still explicitly installing
libdbd-mysql-perl, which causes build failures on Ubuntu 24.04.

Removed libdbd-mysql-perl from:
- Dockerfile.mysql80
- Dockerfile.mysql84
- Dockerfile.percona80

This package is not needed for Vitess functionality.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
The bootstrap Dockerfiles for mysql80 and mysql84 were installing packages
without --no-install-recommends flag. This caused percona-xtrabackup-80/84
to pull in libdbd-mysql-perl as a recommended dependency, which fails on
Ubuntu 24.04 due to libperconaserverclient22 dependency.

Added --no-install-recommends to prevent recommended packages from being
installed automatically.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
The key issue was that Ubuntu 24.04's libdbd-mysql-perl works fine with
system libraries, but the Percona repository has a version that depends
on libperconaserverclient22 (Percona 8.4), which conflicts with our
MySQL/Percona 8.0 setup.

Solution: Install libdbd-mysql-perl from Ubuntu repos BEFORE adding
Percona repositories. This way we get the Ubuntu version, and when we
later add Percona repos and install percona-xtrabackup, apt won't try
to upgrade libdbd-mysql-perl to the incompatible Percona version.

Changes:
- Updated workflow template to install libdbd-mysql-perl early
- Regenerated 3 cluster workflows
- Fixed 6 upgrade/downgrade workflows manually
- Updated all 3 bootstrap Dockerfiles
- Restored libdbd-mysql-perl to docker/utils/install_dependencies.sh

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Now that libdbd-mysql-perl is installed from Ubuntu repos before adding
Percona repositories, percona-toolkit can also be installed safely from
Ubuntu repos. Both packages will be the Ubuntu versions and won't have
the libperconaserverclient22 dependency issue.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
After setup-mysql manually installs an old version of libaio1 (to work around
Ubuntu 24.04 issues), subsequent apt-get install commands were upgrading it,
causing MySQL 5.7 binaries to fail with 'cannot open shared object file'.

Changes:
1. Hold libaio1 package after manual installation (apt-mark hold)
2. Add --no-install-recommends to all apt-get install commands in workflows

This prevents apt from upgrading the manually installed libaio1 package.

Fixes: cluster_endtoend_vreplication_across_db_versions test failures

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
The --no-install-recommends flags were preventing necessary dependencies
from being installed, which may be causing the libaio.so.1 issue.

Reverting to allow recommended packages to be installed normally.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Remove apt-mark hold libaio1 from setup-mysql action as it doesn't help
with the libaio.so.1 issue.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Remove --no-install-recommends flags from DEBIAN_FRONTEND apt-get install
commands to allow recommended packages to be installed.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Keep --no-install-recommends for the common base image to minimize
image size, but allow recommended packages in mysql80 and mysql84
Dockerfiles where they may be needed.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Restore docker/utils/install_dependencies.sh to its original state from
before the libdbd-mysql-perl fixes. The file will have both libdbd-mysql-perl
and percona-toolkit in their original positions.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
@salesforce-cla
Copy link

Thanks for the contribution! Before we can merge this, we need @mhamza15 to sign the Salesforce Inc. Contributor License Agreement.

This test validates PreventCrossCellFailover configuration which Slack
does not use in production.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
@tanjinx tanjinx marked this pull request as ready for review February 13, 2026 18:11
@tanjinx tanjinx requested a review from a team as a code owner February 13, 2026 18:11
PR #677 changed ERS to prioritize data safety by only considering the
majority of most-advanced replicas for promotion. This excludes lagging
replicas even if they have Prefer promotion rules.

This test expects a lagging replica with a Prefer rule to catch up and
then be promoted, but the new behavior removes it from consideration
before the catch-up phase. Slack prioritizes data safety over promotion
preferences in failover scenarios.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
@tanjinx tanjinx merged commit 3698889 into slack-22.0 Feb 16, 2026
90 of 92 checks passed
@tanjinx tanjinx deleted the forward-port-677 branch February 16, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants