"Fix" flaky sharded cluster tests by nammn · Pull Request #682 · mongodb/mongodb-kubernetes

nammn · 2026-01-07T14:19:22Z

Summary

Sharded cluster tests were failing during rolling restarts and architecture migrations due to the MongoDBBackgroundTester health check thresholds being too strict for the actual operation durations.

Root Cause Analysis (CLOUDP-375105)

Analysis of 10 recent Evergreen master builds revealed:

Test	Failures	Root Cause
`e2e_sharded_cluster_migration`	7	Architecture migration restarts cause extended unavailability; threshold of 1 was too strict
`e2e_sharded_cluster_upgrade_downgrade`	6	Config server rolling restart causes ~2 min unavailability window

Why these failures happen:

During sharded cluster version changes, all 3 config server nodes restart in succession
Mongos routers lose connectivity to the config server replica set during this window
The MongoDBBackgroundTester polls every 3 seconds and was configured to fail after just 1-5 consecutive errors
A 2-minute unavailability window would require tolerating ~40 consecutive failures

Why threshold increases are appropriate here:

Config server unavailability during rolling restarts is expected behavior for version changes
The previous thresholds were unrealistically strict given actual operation duration
Setting to 5 is a conservative starting point while CLOUDP-375105 investigates whether the operator can better coordinate primary elections

Changes

sharded_cluster_migration.py: Increased allowed_sequential_failures from 1 → 5
sharded_cluster_upgrade_downgrade.py: Kept at 5, added documentation comment
replica_set.py: Added @pytest.mark.flaky decorator for automatic reruns

Affected Variants

e2e_static_multi_cluster_kind

Proof of Work

Green PR patch

Checklist

Have you linked a jira ticket and/or is the ticket in the title? → CLOUDP-375105
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2026-01-07T14:20:13Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.0 Release Notes

New Features

Allows users to override any Ops Manager emptyDir mount with their own PVCs via overrides statefulSet.spec.volumeClaimTemplates.
Added support for auto embeddings in MongoDB Community to automatically generate vector embeddings for the vector search data. This document can be followed for detailed documentation
MongoDBSearch: Updated the default mongodb/mongodb-search image version to 0.60.1. This is the version MCK uses if .spec.version is not specified.

Bug Fixes

Fix an issue to ensure that hosts are consistently removed from Ops Manager monitoring during AppDB scale-down events.
Fixed an issue where monitoring agents would fail after disabling TLS on a MongoDB deployment.
Persistent Volume Claim resize fix: Fixed an issue where the Operator ignored namespaces when listing PVCs, causing conflicts with resizing PVCs of the same name. Now, PVCs are filtered by both name and namespace for accurate resizing.
Fixed a panic that occurred when the domain names for a horizon was empty. Now, if the domain names are not valid (RFC 1123), the validation will fail before reconciling.
MongoDBMultiCluster, MongoDB: Fix an issue where the operator skipped host removal when an external domain was used, leaving monitoring hosts in Ops Manager even after workloads were correctly removed from the cluster.
Fixed an issue where the Operator could crash when TLS certificates are configured using the certificatesSecretsPrefix field without additional TLS settings.

docker/mongodb-kubernetes-tests/tests/shardedcluster/sharded_cluster_migration.py

lucian-tosa · 2026-01-08T09:10:07Z

docker/mongodb-kubernetes-tests/tests/shardedcluster/sharded_cluster_migration.py

 def mdb_health_checker(mongo_tester: MongoTester) -> MongoDBBackgroundTester:
    return MongoDBBackgroundTester(
        mongo_tester,
-        allowed_sequential_failures=1,


The assert_connectivity method of MongoDBBackgroundTester attempts a write operation. Should we start considering asserting reads instead?
We already expect writes to be down during re-elections, we might as well assert something that we expect to work.

hmmm fair point, but at the same time we don't expect writes to be down the whole migration, just for some amount of time.

No, but the issue is that the downtime intervals are not deterministic. There can be one or N re-elections, we won't know until optimize the restart order. So from a testing standpoint, while there are any pods which are restarting, the writes will not be deterministically available

nammn · 2026-01-12T15:06:08Z

docker/mongodb-kubernetes-tests/tests/replicaset/replica_set.py

        sts = self.appsv1.read_namespaced_stateful_set(RESOURCE_NAME, self.namespace)
        assert sts

+    @pytest.mark.flaky(reruns=10, reruns_delay=3)


that test is racy

nammn · 2026-01-22T16:37:34Z

Related JIRA Ticket

CLOUDP-375105 - Investigate root causes of flaky test failures in Evergreen master builds

Context

Analysis of 10 recent Evergreen master builds identified systematic flakiness in the following tests:

Test	Failures	Variant	Root Cause
`e2e_sharded_cluster_migration`	7	`e2e_static_multi_cluster_kind`	Architecture migration restarts cause extended unavailability
`e2e_sharded_cluster_upgrade_downgrade`	6	`e2e_static_multi_cluster_kind`	Config server rolling restart (~2 min unavailability window)
`e2e_multi_cluster_enable_tls`	3	`e2e_multi_cluster_kind`	Cross-cluster coordination failure (cluster-3 pods never healthy)

About These Threshold Increases

⚠️ These allowed_sequential_failures increases are temporary measures while we investigate the root causes:

Sharded cluster tests: The config server unavailability during rolling restarts (~2 minutes) is likely expected behavior for version changes. Threshold increases align with actual operation duration.
Multi-cluster TLS test: This failure is NOT a timeout issue - pods in cluster-3 never become healthy, blocking TLS transition coordination across all clusters. The agent was stuck at WaitTLSUpdate with 860+ attempts and lastGoalVersionAchieved=-1. This may require operator code fixes rather than threshold adjustments.

Next Steps (tracked in CLOUDP-375105)

Determine if config server rolling restart behavior can be improved (coordinate primary election)
Investigate why cluster-3 pods aren't becoming healthy in multi-cluster deployments
Evaluate if operator code changes are needed vs permanent threshold adjustments

Document the root cause analysis for allowed_sequential_failures values: - sharded_cluster_upgrade_downgrade: Config server rolling restart ~2 min unavailability - sharded_cluster_migration: Architecture migration component restarts These comments link to CLOUDP-375105 which tracks investigation of the underlying causes and evaluation of potential operator improvements.

increase retries

8e185f6

nammn requested a review from a team as a code owner January 7, 2026 14:19

nammn requested review from MaciejKaras and viveksinghggits January 7, 2026 14:19

nammn added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Jan 7, 2026

fix flaky tests

64b52ef

nammn changed the title ~~Fix flaky sharded cluster upgrade/downgrade test in multi-cluster~~ Fix flaky sharded cluster tests Jan 7, 2026

nammn changed the title ~~Fix flaky sharded cluster tests~~ "Fix" flaky sharded cluster tests Jan 7, 2026

nammn commented Jan 7, 2026

View reviewed changes

docker/mongodb-kubernetes-tests/tests/shardedcluster/sharded_cluster_migration.py Show resolved Hide resolved

viveksinghggits approved these changes Jan 7, 2026

View reviewed changes

nammn requested review from lucian-tosa and m1kola January 8, 2026 07:52

lucian-tosa requested changes Jan 8, 2026

View reviewed changes

nammn added 2 commits January 9, 2026 14:07

tolerate elections

91ff11a

tolerate elections

3eb7bd1

nammn marked this pull request as draft January 12, 2026 10:35

tolerate config down

6ae02a6

nammn commented Jan 12, 2026

View reviewed changes

nammn added 4 commits January 12, 2026 16:07

tolerate config down

db1562d

tolerate config down

6d88a50

try waiting for all agents to be healthy first

bbae06f

fix autocommit

2c6a9c1

nammn marked this pull request as ready for review January 22, 2026 16:20

lucian-tosa approved these changes Jan 22, 2026

View reviewed changes

nammn enabled auto-merge (squash) January 22, 2026 17:23

nammn disabled auto-merge January 22, 2026 17:24

nammn enabled auto-merge (squash) January 22, 2026 17:24

nammn merged commit ec95f5e into master Jan 22, 2026
5 of 6 checks passed

nammn deleted the increase-runs branch January 22, 2026 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Fix" flaky sharded cluster tests#682

"Fix" flaky sharded cluster tests#682
nammn merged 10 commits intomasterfrom
increase-runs

nammn commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

lucian-tosa Jan 8, 2026

Uh oh!

nammn Jan 8, 2026

Uh oh!

lucian-tosa Jan 8, 2026

Uh oh!

nammn Jan 12, 2026

Uh oh!

nammn commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nammn commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause Analysis (CLOUDP-375105)

Changes

Affected Variants

Proof of Work

Checklist

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.7.0 Release Notes

New Features

Bug Fixes

Uh oh!

Uh oh!

lucian-tosa Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

nammn Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

lucian-tosa Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

nammn Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

nammn commented Jan 22, 2026

Related JIRA Ticket

Context

About These Threshold Increases

Next Steps (tracked in CLOUDP-375105)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nammn commented Jan 7, 2026 •

edited

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading