CLOUDP-347497: Single cluster Replica Set Controller Refactoring #486

Julien-Ben · 2025-10-01T12:16:17Z

CLOUDP-347497 - Single cluster Replica Set Controller Refactoring

Why this refactoring

The single-cluster RS controller was mixing two concerns:

Kubernetes stuff (StatefulSets, pods, volumes)
Ops Manager/MongoDB stuff (MongoDB processes, replication config)

This worked fine for single-cluster, but it's a problem when you think about multi-cluster:

Multi-cluster has multiple StatefulSets (one per cluster) but only one logical ReplicaSet in Ops Manager
The OM automation config doesn't care about how many K8s clusters you have or how the pods are deployed

So we need to separate these layers properly.

Main changes

1. Broke down the huge Reconcile() method

Before: ~300 lines of inline logic in Reconcile()

Now:

Reconcile()
  ├── reconcileMemberResources()        // Handles all K8s resource creation
  │   ├── reconcileHostnameOverrideConfigMap()
  │   ├── ensureRoles()
  │   └── reconcileStatefulSet()        // StatefulSet-specific logic isolated here
  │       └── buildStatefulSetOptions() // Builds STS configuration
  └── updateOmDeploymentRs()            // Handles Ops Manager automation config updates

This makes it way easier to understand what's happening and matches the multi-cluster controller structure.

2. Removed StatefulSet dependency from OM operations

Created new helper functions that work directly with MongoDB resources instead of StatefulSets:

CreateMongodProcessesFromMongoDB() - was using StatefulSet before
BuildFromMongoDBWithReplicas() - same
WaitForRsAgentsToRegisterByResource() - same

These mirror the existing ...FromStatefulSet functions but take MongoDB resources instead.

Why it matters: The OM layer now only cares about the MongoDB resource definition, not how it's deployed in K8s. This makes the code work the same way for both single-cluster and multi-cluster.

3. Added publishAutomationConfigFirstRS checks

Dedicated function for RS instead of using the shared one. Does not rely on a statefulset object.

Important for review

The ideal way to review this PR is to compare the new structure to the mongodbmultireplicaset_controller.go. The aim of the refactoring is to get the single cluster controller closer to it.

Look at:

reconcileMemberResources() in both controllers - similar structure now
updateOmDeploymentRs() - no more StatefulSet dependency
New helper functions in om/process and om/replicaset - mirror existing patterns

Bug found along the way

The logic to handle scale up + disable TLS at the same time doesn't actually work properly and should be blocked by validation (see draft PR #490 for more details).

Tests added

Added tests for the new helper functions:

TestCreateMongodProcessesFromMongoDB - basic scenarios, scaling, custom domains, TLS, additional config
TestBuildFromMongoDBWithReplicas - integration test checking ReplicaSet structure and member options propagation
TestPublishAutomationConfigFirstRS - automation config publish logic with various TLS/auth scenarios

github-actions · 2025-10-01T12:17:25Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.5.0 Release Notes

New Features

Improve automation agent certificate rotation: the agent now restarts automatically when its certificate is renewed, ensuring smooth operation without manual intervention and allowing seamless certificate updates without requiring manual Pod restarts.

Bug Fixes

MongoDBMultiCluster: fix resource stuck in Pending state if any clusterSpecList item has 0 members. After the fix, a value of 0 members is handled correctly, similarly to how it's done in the MongoDB resource.

This reverts commit 073032b.

Builder functions Test works

m1kola

Early review: I had a quick-ish look at the controllers and couldn't spot anything obviously wrong with it. I didn't review the tests at all.

Looks good so far.

api/v1/search/zz_generated.deepcopy.go

.gitignore

Julien-Ben · 2025-10-07T16:22:53Z

controllers/operator/mongodbmultireplicaset_controller.go

 		}
 	}

+	publishAutomationConfigFirst, err := r.publishAutomationConfigFirstMultiCluster(ctx, &mrs, log)


Brought it closer to where it's actually used

Julien-Ben · 2025-10-07T16:26:59Z

controllers/om/replicaset/om_replicaset.go

 		log.Debugw("Marked replica set members as non-voting", "replica set with members", rsMembers)
 	}

-	// TODO practice shows that automation agents can get stuck on setting db to "disabled" also it seems that this process


This comment was old (<2022)

Julien-Ben · 2025-10-07T16:31:39Z

controllers/operator/mongodbreplicaset_controller.go

 // Reconcile reads that state of the cluster for a MongoDbReplicaSet object and makes changes based on the state read
 // and what is in the MongoDbReplicaSet.Spec
 func (r *ReconcileMongoDbReplicaSet) Reconcile(ctx context.Context, request reconcile.Request) (res reconcile.Result, e error) {
+	// === 1. Initial Checks and setup


I added these as helpers for myself at the beginning of the refactoring, to keep track of where I was in the hundreds of lines of the reconcile loop

I'm okay to remove them now if they feel like noise more than useful comments

m1kola

Looks good to me: I didn't pot any issues.

Let a few comments\suggestions, but I do not consider them blocking.

m1kola · 2025-10-10T12:23:01Z

controllers/om/process/om_process_test.go

+	defaultNamespace    = "test-namespace"
+)
+
+func TestCreateMongodProcessesFromMongoDB(t *testing.T) {


This test and TestCreateMongodProcessesFromMongoDB_AdditionalConfig seem to be testing om.NewMongodProcess (already has own unit tests) and dns.GetDNSNames (doesn't have own unit tests).

I think it is better to unit-test om.NewMongodProcess and dns.GetDNSNames separately. Once you have unit tests for these building blocks you wouldn't need to test the integration so thoroughly.

This also would (indirectly) cover WaitForRsAgentsToRegisterByResource and other places where these functions used.

m1kola · 2025-10-10T13:00:58Z

controllers/operator/mongodbreplicaset_controller.go

+		agentCertPath:        agentCertPath,
+		agentCertHash:        agentCertHash,
+		currentAgentAuthMode: currentAgentAuthMode,
 	}


Nit: I think a long list of arguments is better than a struct. It is easier to pass a struct around, but arguments are more explicit: you only pass what you need in a given function. With plain arguments it is easier to spot unused arguments as well.

There is also no question like - why some cert related stuff is in the struct, but tlsCertPath and internalClusterCertPath is not?

I would leave arguments as is.

I agree with Mikalai here, it is a bit confusing

That's a valid concern, I hesitated and did it mostly to match what we do in the sharded cluster controller:

type deploymentOptions struct { podEnvVars *env.PodEnvVars currentAgentAuthMode string caFilePath string agentCertPath string agentCertHash string certTLSType map[string]bool finalizing bool processNames []string prometheusCertHash string }

If we decide to keep the struct, I will put more variables into it for sure while working on the unification, for more consistency.

I'm happy to hear the team's opinion about this

lucian-tosa

LGTM, just a couple of questions

lucian-tosa · 2025-10-13T09:05:38Z

controllers/operator/mongodbreplicaset_controller.go

+		agentCertPath:        agentCertPath,
+		agentCertHash:        agentCertHash,
+		currentAgentAuthMode: currentAgentAuthMode,
 	}


I agree with Mikalai here, it is a bit confusing

lucian-tosa · 2025-10-13T09:08:47Z

controllers/operator/mongodbreplicaset_controller.go

 	return r.updateStatus(ctx, rs, workflow.OK(), log, mdbstatus.NewBaseUrlOption(deployment.Link(conn.BaseURL(), conn.GroupID())), mdbstatus.MembersOption(rs), mdbstatus.NewPVCsStatusOptionEmptyStatus())
 }

+func publishAutomationConfigFirstRS(ctx context.Context, getter kubernetesClient.Client, mdb mdbv1.MongoDB, lastSpec *mdbv1.MongoDbSpec, currentAgentAuthMode string, sslMMSCAConfigMap string, log *zap.SugaredLogger) bool {


Why define a new function here? There is a very similar one in common controller?

lucian-tosa · 2025-10-13T14:12:40Z

controllers/operator/mongodbreplicaset_controller.go

-			if shouldMirrorKeyfileForMongot {
+			if shouldMirrorKeyfile {


nit: Why did you rename this variable? It might be a bit too vague now

I would say that even this is too long, something like applyOverrides would be sufficient.
If you feel that this would not make the purpose of the variable clear enough it is because the function is too big and a lot is going on here.
It should ideally be broken into smaller chunks called by an orchestrator function so that flow of logic is easier to follow.

I have posted a comment above about this.

https://go.dev/wiki/CodeReviewComments#variable-names

anandsyncs

Left some thoughts on further refactoring and a couple of nits.

anandsyncs · 2025-10-13T21:48:01Z

controllers/operator/mongodbreplicaset_controller.go

+	prometheusCertHash, err := certs.EnsureTLSCertsForPrometheus(ctx, r.SecretClient, rs.GetNamespace(), rs.GetPrometheus(), certs.Database, log)
+	if err != nil {
+		log.Infof("Could not generate certificates for Prometheus: %s", err)
+		return r.updateStatus(ctx, rs, workflow.Failed(err), log)


Suggested change

return r.updateStatus(ctx, rs, workflow.Failed(err), log)

return r.updateStatus(ctx, rs, workflow.Failed(xerrors.Errorf("could not generate certificates for Prometheus: %w", err)), log)

other errors are more descriptive, is it better to clarify this one too?

anandsyncs · 2025-10-14T08:23:59Z

controllers/operator/mongodbreplicaset_controller.go

+	databaseContainer := container.GetByName(util.DatabaseContainerName, currentSts.Spec.Template.Spec.Containers)
+	volumeMounts := databaseContainer.VolumeMounts
+
+	if !mdb.Spec.Security.IsTLSEnabled() && wasTLSSecretMounted(ctx, getter, currentSts, mdb, log) {


We probably need a nil check on .Security

anandsyncs · 2025-10-14T09:06:57Z

controllers/operator/mongodbreplicaset_controller.go

 // updateOmDeploymentRs performs OM registration operation for the replicaset. So the changes will be finally propagated
 // to automation agents in containers
-func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs(ctx context.Context, conn om.Connection, membersNumberBefore int, rs *mdbv1.MongoDB, set appsv1.StatefulSet, log *zap.SugaredLogger, agentCertPath, caFilePath, tlsCertPath, internalClusterCertPath string, prometheusCertHash string, isRecovering bool, shouldMirrorKeyfileForMongot bool) workflow.Status {
+func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs(ctx context.Context, conn om.Connection, membersNumberBefore int, rs *mdbv1.MongoDB, log *zap.SugaredLogger, tlsCertPath, internalClusterCertPath string, deploymentOptionsRS deploymentOptionsRS, shouldMirrorKeyfile bool, isRecovering bool) workflow.Status {


updateOmDeploymentRs is carrying a lot of responsibilities: waiting for agents, TLS disable coordination, replica-set building, authentication sync, automation-config reconciliation, and cleanup.

func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs( ctx context.Context, conn om.Connection, membersBefore int, rs *mdbv1.MongoDB, log *zap.SugaredLogger, tlsCertPath, internalClusterCertPath string, opts deploymentOptionsRS, shouldMirrorKeyfile, isRecovering bool, ) workflow.Status { inputs, status := r.prepareOmDeploymentInputs(ctx, conn, membersBefore, rs, tlsCertPath, opts, shouldMirrorKeyfile, isRecovering, log) if !status.IsOK() { return status } if status := r.applyReplicaSetAutomation(ctx, conn, rs, inputs, internalClusterCertPath, log); !status.IsOK() { return status } return r.finalizeOmDeployment(ctx, conn, rs, membersBefore, inputs, internalClusterCertPath, isRecovering, log) }

Each helper focuses on one concern:

type omDeploymentInputs struct { replicasTarget int replicaSet replicaset.ReplicaSet processNames []string prometheusConfiguration PrometheusConfiguration additionalReconcileNeeded bool } func (r *ReconcileMongoDbReplicaSet) prepareOmDeploymentInputs(...) (omDeploymentInputs, workflow.Status) { /* wait for agents, handle TLS disable, build replicaSet */ } func (r *ReconcileMongoDbReplicaSet) applyReplicaSetAutomation(...) workflow.Status { /* auth reconciliation + ReadUpdateDeployment */ } func (r *ReconcileMongoDbReplicaSet) finalizeOmDeployment(...) workflow.Status { /* wait for ready, log rotate, host diff, backup */ }

You can also use parameters instead of omDeploymentInputs

anandsyncs · 2025-10-14T09:07:37Z

controllers/operator/mongodbreplicaset_controller.go

-			if shouldMirrorKeyfileForMongot {
+			if shouldMirrorKeyfile {


I would say that even this is too long, something like applyOverrides would be sufficient.
If you feel that this would not make the purpose of the variable clear enough it is because the function is too big and a lot is going on here.
It should ideally be broken into smaller chunks called by an orchestrator function so that flow of logic is easier to follow.

I have posted a comment above about this.

https://go.dev/wiki/CodeReviewComments#variable-names

Julien-Ben added 8 commits October 1, 2025 10:10

RS Controller refactor fixed merge conflicts

a6b9d48

Fix and improve certificates handling

4196529

Pass current agent auth mode

1eead94

currentAgentAuthMode in deploymentOptions

e913144

PrepareScaleDown without need for sts

74cc89a

Pass configmap to publishAutomationConfigFirstRS directly

dd4756f

Lint

c3ef579

Commit TODOs

e41d24a

Julien-Ben added 10 commits October 1, 2025 14:27

Remove TODOs

c89617b

Lint

eb78918

Edge case TLS disabled and rs scaled

073032b

Remove unused lines

8c4f229

Import order

691d804

Fix comment

31daeae

Run applySearchOverrides early

310030e

Revert "Edge case TLS disabled and rs scaled"

82467d6

This reverts commit 073032b.

Revert to controller gen 0.18

c7268d1

Fix edge case

e302990

Julien-Ben added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Oct 7, 2025

TestPublishAutomationConfigFirstRS

02920f8

Builder functions Test works

m1kola reviewed Oct 7, 2025

View reviewed changes

api/v1/search/zz_generated.deepcopy.go Outdated Show resolved Hide resolved

.gitignore Outdated Show resolved Hide resolved

Julien-Ben added 7 commits October 7, 2025 15:14

Remove old comment

580927f

TestCreateMongodProcessesFromMongoDB

ea14853

TestBuildFromMongoDBWithReplicas

1376b5d

Revert unrelated changed

3c30758

Lint`

e2d2e33

Merge branch 'master' into jben/rs-controller-refactor-clean-rebase

1c1446a

Update doc

96e6dc0

Julien-Ben commented Oct 7, 2025

View reviewed changes

Julien-Ben marked this pull request as ready for review October 7, 2025 16:27

Julien-Ben requested a review from a team as a code owner October 7, 2025 16:27

Julien-Ben requested review from anandsyncs, lsierant, lucian-tosa and m1kola October 7, 2025 16:27

Julien-Ben commented Oct 7, 2025

View reviewed changes

Julien-Ben changed the title ~~[DO NOT REVIEW] WIP~~ CLOUDP-347497: Single cluster Replica Set Controller Refactoring Oct 8, 2025

Julien-Ben mentioned this pull request Oct 10, 2025

CLOUDP-323997: Refactor replicaset controller #185

Closed

m1kola approved these changes Oct 10, 2025

View reviewed changes

lucian-tosa approved these changes Oct 13, 2025

View reviewed changes

anandsyncs reviewed Oct 14, 2025

View reviewed changes

	return r.updateStatus(ctx, rs, workflow.Failed(err), log)
	return r.updateStatus(ctx, rs, workflow.Failed(xerrors.Errorf("could not generate certificates for Prometheus: %w", err)), log)

CLOUDP-347497: Single cluster Replica Set Controller Refactoring #486

Are you sure you want to change the base?

CLOUDP-347497: Single cluster Replica Set Controller Refactoring #486

Uh oh!

Conversation

Julien-Ben commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CLOUDP-347497 - Single cluster Replica Set Controller Refactoring

Why this refactoring

Main changes

1. Broke down the huge Reconcile() method

2. Removed StatefulSet dependency from OM operations

3. Added publishAutomationConfigFirstRS checks

Important for review

Bug found along the way

Tests added

Uh oh!

github-actions bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.5.0 Release Notes

New Features

Bug Fixes

Uh oh!

m1kola left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m1kola left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucian-tosa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anandsyncs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Julien-Ben commented Oct 1, 2025 •

edited

Loading

github-actions bot commented Oct 1, 2025 •

edited

Loading