Skip to content

"Fix" flaky sharded cluster tests#682

Merged
nammn merged 10 commits intomasterfrom
increase-runs
Jan 22, 2026
Merged

"Fix" flaky sharded cluster tests#682
nammn merged 10 commits intomasterfrom
increase-runs

Conversation

@nammn
Copy link
Collaborator

@nammn nammn commented Jan 7, 2026

Summary

Sharded cluster tests were failing during rolling restarts and architecture migrations due to the MongoDBBackgroundTester health check thresholds being too strict for the actual operation durations.

Root Cause Analysis (CLOUDP-375105)

Analysis of 10 recent Evergreen master builds revealed:

Test Failures Root Cause
e2e_sharded_cluster_migration 7 Architecture migration restarts cause extended unavailability; threshold of 1 was too strict
e2e_sharded_cluster_upgrade_downgrade 6 Config server rolling restart causes ~2 min unavailability window

Why these failures happen:

  • During sharded cluster version changes, all 3 config server nodes restart in succession
  • Mongos routers lose connectivity to the config server replica set during this window
  • The MongoDBBackgroundTester polls every 3 seconds and was configured to fail after just 1-5 consecutive errors
  • A 2-minute unavailability window would require tolerating ~40 consecutive failures

Why threshold increases are appropriate here:

  • Config server unavailability during rolling restarts is expected behavior for version changes
  • The previous thresholds were unrealistically strict given actual operation duration
  • Setting to 5 is a conservative starting point while CLOUDP-375105 investigates whether the operator can better coordinate primary elections

Changes

  1. sharded_cluster_migration.py: Increased allowed_sequential_failures from 1 → 5
  2. sharded_cluster_upgrade_downgrade.py: Kept at 5, added documentation comment
  3. replica_set.py: Added @pytest.mark.flaky decorator for automatic reruns

Affected Variants

  • e2e_static_multi_cluster_kind

Proof of Work

Green PR patch

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title? → CLOUDP-375105
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

@nammn nammn requested a review from a team as a code owner January 7, 2026 14:19
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.0 Release Notes

New Features

  • Allows users to override any Ops Manager emptyDir mount with their own PVCs via overrides statefulSet.spec.volumeClaimTemplates.
  • Added support for auto embeddings in MongoDB Community to automatically generate vector embeddings for the vector search data. This document can be followed for detailed documentation
  • MongoDBSearch: Updated the default mongodb/mongodb-search image version to 0.60.1. This is the version MCK uses if .spec.version is not specified.

Bug Fixes

  • Fix an issue to ensure that hosts are consistently removed from Ops Manager monitoring during AppDB scale-down events.
  • Fixed an issue where monitoring agents would fail after disabling TLS on a MongoDB deployment.
  • Persistent Volume Claim resize fix: Fixed an issue where the Operator ignored namespaces when listing PVCs, causing conflicts with resizing PVCs of the same name. Now, PVCs are filtered by both name and namespace for accurate resizing.
  • Fixed a panic that occurred when the domain names for a horizon was empty. Now, if the domain names are not valid (RFC 1123), the validation will fail before reconciling.
  • MongoDBMultiCluster, MongoDB: Fix an issue where the operator skipped host removal when an external domain was used, leaving monitoring hosts in Ops Manager even after workloads were correctly removed from the cluster.
  • Fixed an issue where the Operator could crash when TLS certificates are configured using the certificatesSecretsPrefix field without additional TLS settings.

@nammn nammn added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Jan 7, 2026
@nammn nammn changed the title Fix flaky sharded cluster upgrade/downgrade test in multi-cluster Fix flaky sharded cluster tests Jan 7, 2026
@nammn nammn changed the title Fix flaky sharded cluster tests "Fix" flaky sharded cluster tests Jan 7, 2026
@nammn nammn requested review from lucian-tosa and m1kola January 8, 2026 07:52
def mdb_health_checker(mongo_tester: MongoTester) -> MongoDBBackgroundTester:
return MongoDBBackgroundTester(
mongo_tester,
allowed_sequential_failures=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert_connectivity method of MongoDBBackgroundTester attempts a write operation. Should we start considering asserting reads instead?
We already expect writes to be down during re-elections, we might as well assert something that we expect to work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm fair point, but at the same time we don't expect writes to be down the whole migration, just for some amount of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but the issue is that the downtime intervals are not deterministic. There can be one or N re-elections, we won't know until optimize the restart order. So from a testing standpoint, while there are any pods which are restarting, the writes will not be deterministically available

@nammn nammn marked this pull request as draft January 12, 2026 10:35
sts = self.appsv1.read_namespaced_stateful_set(RESOURCE_NAME, self.namespace)
assert sts

@pytest.mark.flaky(reruns=10, reruns_delay=3)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that test is racy

@nammn nammn marked this pull request as ready for review January 22, 2026 16:20
@nammn
Copy link
Collaborator Author

nammn commented Jan 22, 2026

Related JIRA Ticket

CLOUDP-375105 - Investigate root causes of flaky test failures in Evergreen master builds


Context

Analysis of 10 recent Evergreen master builds identified systematic flakiness in the following tests:

Test Failures Variant Root Cause
e2e_sharded_cluster_migration 7 e2e_static_multi_cluster_kind Architecture migration restarts cause extended unavailability
e2e_sharded_cluster_upgrade_downgrade 6 e2e_static_multi_cluster_kind Config server rolling restart (~2 min unavailability window)
e2e_multi_cluster_enable_tls 3 e2e_multi_cluster_kind Cross-cluster coordination failure (cluster-3 pods never healthy)

About These Threshold Increases

⚠️ These allowed_sequential_failures increases are temporary measures while we investigate the root causes:

  • Sharded cluster tests: The config server unavailability during rolling restarts (~2 minutes) is likely expected behavior for version changes. Threshold increases align with actual operation duration.

  • Multi-cluster TLS test: This failure is NOT a timeout issue - pods in cluster-3 never become healthy, blocking TLS transition coordination across all clusters. The agent was stuck at WaitTLSUpdate with 860+ attempts and lastGoalVersionAchieved=-1. This may require operator code fixes rather than threshold adjustments.

Next Steps (tracked in CLOUDP-375105)

  1. Determine if config server rolling restart behavior can be improved (coordinate primary election)
  2. Investigate why cluster-3 pods aren't becoming healthy in multi-cluster deployments
  3. Evaluate if operator code changes are needed vs permanent threshold adjustments

Document the root cause analysis for allowed_sequential_failures values:
- sharded_cluster_upgrade_downgrade: Config server rolling restart ~2 min unavailability
- sharded_cluster_migration: Architecture migration component restarts

These comments link to CLOUDP-375105 which tracks investigation of the
underlying causes and evaluation of potential operator improvements.
@nammn nammn enabled auto-merge (squash) January 22, 2026 17:23
@nammn nammn disabled auto-merge January 22, 2026 17:24
@nammn nammn enabled auto-merge (squash) January 22, 2026 17:24
@nammn nammn merged commit ec95f5e into master Jan 22, 2026
5 of 6 checks passed
@nammn nammn deleted the increase-runs branch January 22, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Use this label in Pull Request to not require new changelog entry file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants