-
Notifications
You must be signed in to change notification settings - Fork 21
CLOUDP-347497: Single cluster Replica Set Controller Refactoring #486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
MCK 1.5.0 Release NotesNew Features
Bug Fixes
|
This reverts commit 073032b.
Builder functions Test works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early review: I had a quick-ish look at the controllers and couldn't spot anything obviously wrong with it. I didn't review the tests at all.
Looks good so far.
} | ||
} | ||
|
||
publishAutomationConfigFirst, err := r.publishAutomationConfigFirstMultiCluster(ctx, &mrs, log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brought it closer to where it's actually used
log.Debugw("Marked replica set members as non-voting", "replica set with members", rsMembers) | ||
} | ||
|
||
// TODO practice shows that automation agents can get stuck on setting db to "disabled" also it seems that this process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was old (<2022)
// Reconcile reads that state of the cluster for a MongoDbReplicaSet object and makes changes based on the state read | ||
// and what is in the MongoDbReplicaSet.Spec | ||
func (r *ReconcileMongoDbReplicaSet) Reconcile(ctx context.Context, request reconcile.Request) (res reconcile.Result, e error) { | ||
// === 1. Initial Checks and setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these as helpers for myself at the beginning of the refactoring, to keep track of where I was in the hundreds of lines of the reconcile loop
I'm okay to remove them now if they feel like noise more than useful comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me: I didn't pot any issues.
Let a few comments\suggestions, but I do not consider them blocking.
defaultNamespace = "test-namespace" | ||
) | ||
|
||
func TestCreateMongodProcessesFromMongoDB(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test and TestCreateMongodProcessesFromMongoDB_AdditionalConfig
seem to be testing om.NewMongodProcess
(already has own unit tests) and dns.GetDNSNames
(doesn't have own unit tests).
I think it is better to unit-test om.NewMongodProcess
and dns.GetDNSNames
separately. Once you have unit tests for these building blocks you wouldn't need to test the integration so thoroughly.
This also would (indirectly) cover WaitForRsAgentsToRegisterByResource
and other places where these functions used.
agentCertPath: agentCertPath, | ||
agentCertHash: agentCertHash, | ||
currentAgentAuthMode: currentAgentAuthMode, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think a long list of arguments is better than a struct. It is easier to pass a struct around, but arguments are more explicit: you only pass what you need in a given function. With plain arguments it is easier to spot unused arguments as well.
There is also no question like - why some cert related stuff is in the struct, but tlsCertPath
and internalClusterCertPath
is not?
I would leave arguments as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Mikalai here, it is a bit confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a valid concern, I hesitated and did it mostly to match what we do in the sharded cluster controller:
type deploymentOptions struct {
podEnvVars *env.PodEnvVars
currentAgentAuthMode string
caFilePath string
agentCertPath string
agentCertHash string
certTLSType map[string]bool
finalizing bool
processNames []string
prometheusCertHash string
}
If we decide to keep the struct, I will put more variables into it for sure while working on the unification, for more consistency.
I'm happy to hear the team's opinion about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple of questions
agentCertPath: agentCertPath, | ||
agentCertHash: agentCertHash, | ||
currentAgentAuthMode: currentAgentAuthMode, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Mikalai here, it is a bit confusing
return r.updateStatus(ctx, rs, workflow.OK(), log, mdbstatus.NewBaseUrlOption(deployment.Link(conn.BaseURL(), conn.GroupID())), mdbstatus.MembersOption(rs), mdbstatus.NewPVCsStatusOptionEmptyStatus()) | ||
} | ||
|
||
func publishAutomationConfigFirstRS(ctx context.Context, getter kubernetesClient.Client, mdb mdbv1.MongoDB, lastSpec *mdbv1.MongoDbSpec, currentAgentAuthMode string, sslMMSCAConfigMap string, log *zap.SugaredLogger) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why define a new function here? There is a very similar one in common controller?
if shouldMirrorKeyfileForMongot { | ||
if shouldMirrorKeyfile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why did you rename this variable? It might be a bit too vague now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that even this is too long, something like applyOverrides
would be sufficient.
If you feel that this would not make the purpose of the variable clear enough it is because the function is too big and a lot is going on here.
It should ideally be broken into smaller chunks called by an orchestrator function so that flow of logic is easier to follow.
I have posted a comment above about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some thoughts on further refactoring and a couple of nits.
prometheusCertHash, err := certs.EnsureTLSCertsForPrometheus(ctx, r.SecretClient, rs.GetNamespace(), rs.GetPrometheus(), certs.Database, log) | ||
if err != nil { | ||
log.Infof("Could not generate certificates for Prometheus: %s", err) | ||
return r.updateStatus(ctx, rs, workflow.Failed(err), log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return r.updateStatus(ctx, rs, workflow.Failed(err), log) | |
return r.updateStatus(ctx, rs, workflow.Failed(xerrors.Errorf("could not generate certificates for Prometheus: %w", err)), log) |
other errors are more descriptive, is it better to clarify this one too?
databaseContainer := container.GetByName(util.DatabaseContainerName, currentSts.Spec.Template.Spec.Containers) | ||
volumeMounts := databaseContainer.VolumeMounts | ||
|
||
if !mdb.Spec.Security.IsTLSEnabled() && wasTLSSecretMounted(ctx, getter, currentSts, mdb, log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need a nil check on .Security
// updateOmDeploymentRs performs OM registration operation for the replicaset. So the changes will be finally propagated | ||
// to automation agents in containers | ||
func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs(ctx context.Context, conn om.Connection, membersNumberBefore int, rs *mdbv1.MongoDB, set appsv1.StatefulSet, log *zap.SugaredLogger, agentCertPath, caFilePath, tlsCertPath, internalClusterCertPath string, prometheusCertHash string, isRecovering bool, shouldMirrorKeyfileForMongot bool) workflow.Status { | ||
func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs(ctx context.Context, conn om.Connection, membersNumberBefore int, rs *mdbv1.MongoDB, log *zap.SugaredLogger, tlsCertPath, internalClusterCertPath string, deploymentOptionsRS deploymentOptionsRS, shouldMirrorKeyfile bool, isRecovering bool) workflow.Status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateOmDeploymentRs
is carrying a lot of responsibilities: waiting for agents, TLS disable coordination, replica-set building, authentication sync, automation-config reconciliation, and cleanup.
func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs(
ctx context.Context,
conn om.Connection,
membersBefore int,
rs *mdbv1.MongoDB,
log *zap.SugaredLogger,
tlsCertPath, internalClusterCertPath string,
opts deploymentOptionsRS,
shouldMirrorKeyfile, isRecovering bool,
) workflow.Status {
inputs, status := r.prepareOmDeploymentInputs(ctx, conn, membersBefore, rs, tlsCertPath, opts, shouldMirrorKeyfile, isRecovering, log)
if !status.IsOK() {
return status
}
if status := r.applyReplicaSetAutomation(ctx, conn, rs, inputs, internalClusterCertPath, log); !status.IsOK() {
return status
}
return r.finalizeOmDeployment(ctx, conn, rs, membersBefore, inputs, internalClusterCertPath, isRecovering, log)
}
Each helper focuses on one concern:
type omDeploymentInputs struct {
replicasTarget int
replicaSet replicaset.ReplicaSet
processNames []string
prometheusConfiguration PrometheusConfiguration
additionalReconcileNeeded bool
}
func (r *ReconcileMongoDbReplicaSet) prepareOmDeploymentInputs(...) (omDeploymentInputs, workflow.Status) { /* wait for agents, handle TLS disable, build replicaSet */ }
func (r *ReconcileMongoDbReplicaSet) applyReplicaSetAutomation(...) workflow.Status { /* auth reconciliation + ReadUpdateDeployment */ }
func (r *ReconcileMongoDbReplicaSet) finalizeOmDeployment(...) workflow.Status { /* wait for ready, log rotate, host diff, backup */ }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use parameters instead of omDeploymentInputs
if shouldMirrorKeyfileForMongot { | ||
if shouldMirrorKeyfile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that even this is too long, something like applyOverrides
would be sufficient.
If you feel that this would not make the purpose of the variable clear enough it is because the function is too big and a lot is going on here.
It should ideally be broken into smaller chunks called by an orchestrator function so that flow of logic is easier to follow.
I have posted a comment above about this.
CLOUDP-347497 - Single cluster Replica Set Controller Refactoring
Why this refactoring
The single-cluster RS controller was mixing two concerns:
This worked fine for single-cluster, but it's a problem when you think about multi-cluster:
So we need to separate these layers properly.
Main changes
1. Broke down the huge Reconcile() method
Before: ~300 lines of inline logic in Reconcile()
Now:
This makes it way easier to understand what's happening and matches the multi-cluster controller structure.
2. Removed StatefulSet dependency from OM operations
Created new helper functions that work directly with MongoDB resources instead of StatefulSets:
CreateMongodProcessesFromMongoDB()
- was using StatefulSet beforeBuildFromMongoDBWithReplicas()
- sameWaitForRsAgentsToRegisterByResource()
- sameThese mirror the existing
...FromStatefulSet
functions but take MongoDB resources instead.Why it matters: The OM layer now only cares about the MongoDB resource definition, not how it's deployed in K8s. This makes the code work the same way for both single-cluster and multi-cluster.
3. Added publishAutomationConfigFirstRS checks
Dedicated function for RS instead of using the shared one. Does not rely on a statefulset object.
Important for review
The ideal way to review this PR is to compare the new structure to the
mongodbmultireplicaset_controller.go
. The aim of the refactoring is to get the single cluster controller closer to it.Look at:
reconcileMemberResources()
in both controllers - similar structure nowupdateOmDeploymentRs()
- no more StatefulSet dependencyom/process
andom/replicaset
- mirror existing patternsBug found along the way
The logic to handle scale up + disable TLS at the same time doesn't actually work properly and should be blocked by validation (see draft PR #490 for more details).
Tests added
Added tests for the new helper functions:
TestCreateMongodProcessesFromMongoDB
- basic scenarios, scaling, custom domains, TLS, additional configTestBuildFromMongoDBWithReplicas
- integration test checking ReplicaSet structure and member options propagationTestPublishAutomationConfigFirstRS
- automation config publish logic with various TLS/auth scenarios