Conversation
MCK 1.7.0 Release NotesNew Features
Bug Fixes
|
docker/mongodb-kubernetes-tests/tests/shardedcluster/sharded_cluster_migration.py
Show resolved
Hide resolved
| def mdb_health_checker(mongo_tester: MongoTester) -> MongoDBBackgroundTester: | ||
| return MongoDBBackgroundTester( | ||
| mongo_tester, | ||
| allowed_sequential_failures=1, |
There was a problem hiding this comment.
The assert_connectivity method of MongoDBBackgroundTester attempts a write operation. Should we start considering asserting reads instead?
We already expect writes to be down during re-elections, we might as well assert something that we expect to work.
There was a problem hiding this comment.
hmmm fair point, but at the same time we don't expect writes to be down the whole migration, just for some amount of time.
There was a problem hiding this comment.
No, but the issue is that the downtime intervals are not deterministic. There can be one or N re-elections, we won't know until optimize the restart order. So from a testing standpoint, while there are any pods which are restarting, the writes will not be deterministically available
| sts = self.appsv1.read_namespaced_stateful_set(RESOURCE_NAME, self.namespace) | ||
| assert sts | ||
|
|
||
| @pytest.mark.flaky(reruns=10, reruns_delay=3) |
Related JIRA TicketCLOUDP-375105 - Investigate root causes of flaky test failures in Evergreen master builds ContextAnalysis of 10 recent Evergreen master builds identified systematic flakiness in the following tests:
About These Threshold Increases
Next Steps (tracked in CLOUDP-375105)
|
Document the root cause analysis for allowed_sequential_failures values: - sharded_cluster_upgrade_downgrade: Config server rolling restart ~2 min unavailability - sharded_cluster_migration: Architecture migration component restarts These comments link to CLOUDP-375105 which tracks investigation of the underlying causes and evaluation of potential operator improvements.
Summary
Sharded cluster tests were failing during rolling restarts and architecture migrations due to the
MongoDBBackgroundTesterhealth check thresholds being too strict for the actual operation durations.Root Cause Analysis (CLOUDP-375105)
Analysis of 10 recent Evergreen master builds revealed:
e2e_sharded_cluster_migratione2e_sharded_cluster_upgrade_downgradeWhy these failures happen:
MongoDBBackgroundTesterpolls every 3 seconds and was configured to fail after just 1-5 consecutive errorsWhy threshold increases are appropriate here:
Changes
sharded_cluster_migration.py: Increasedallowed_sequential_failuresfrom 1 → 5sharded_cluster_upgrade_downgrade.py: Kept at 5, added documentation commentreplica_set.py: Added@pytest.mark.flakydecorator for automatic rerunsAffected Variants
e2e_static_multi_cluster_kindProof of Work
Green PR patch
Checklist
skip-changeloglabel if not needed