Multi rack support #192

alourie · 2019-07-05T02:49:23Z

first mvp

Seems to be working ok for creating clusters, should be ok for scaling up, but probably not for scaling down

Signed-off-by: Alex Lourie <[email protected]>

* Change proj structure to follow the same pattern for CDC/Backup controllers * Refactor code for better naming and convenience * Add StatefulSet handling * Add stub backup code * Add backupType field to BackupOperation on java/golang sidecar. Signed-off-by: Alex Lourie <[email protected]>

Signed-off-by: Alex Lourie <[email protected]>

… backups

Signed-off-by: Alex Lourie <[email protected]>

… multiRack

* Support multi-racks * Probably doesn't support scaling down properly Signed-off-by: Alex Lourie <[email protected]>

Signed-off-by: Alex Lourie <[email protected]>

alourie · 2019-07-05T03:03:56Z

I'll squash this when merging, don't worry

Signed-off-by: Alex Lourie <[email protected]>

… multiRack

Signed-off-by: Alex Lourie <[email protected]>

…assandra-operator into multiRack

Signed-off-by: Alex Lourie <[email protected]>

alourie · 2019-07-25T09:32:20Z

Should be ok now for deploying and scaling, as well as some basic e2e testing.

zegelin · 2019-07-26T22:23:58Z

Some general comments:

Instead of a ConfigMap per rack (yuck!), let's instead implement a custom Snitch class that somehow can infer the rack and dc from the environment.
Some places seem to deal with Rack structs, others map[string]int32, and other places just string. Let's switch everything to Rack for consistency.
Maybe I missed it somewhere, but if n-racks = n-statefulsets, there should be a loop somewhere that loops over each rack and tries to reconcile each statefulset, no?

pkg/controller/cassandradatacenter/configmap.go

pkg/controller/cassandradatacenter/reconciler.go

zegelin · 2019-07-26T22:45:38Z

pkg/controller/cassandradatacenter/reconciler.go

+	}
+
+	// fine, so we have a mismatch. Will seek to reconcile.
+	rackSpec, err := getRackSpec(rctx, request)


Looking into this more -- we only reconcile one Rack per reconcile? What triggers the additional reconciles?

Yes, we now only reconcile one rack per reconcile. You are correct that at the moment it doesn't support config changes/upgrades, but scaling up/down will work because k8 will reconcile until the total number of pods matches the spec.

alourie · 2019-07-27T10:35:14Z

Thanks @zegelin.

I will look into dropping the rack/dc data into the pod/C* container and configuring it either via script or snitch. This will eliminate the need for multiple config maps and will get us a bit closer to doing it all with 1 StatefulSet (not yet possible really).
Yes, Rack is better, I only recently introduced it as a convenience and haven't gotten to fixing all the places yet. Will do so.
We don't need to loop over racks to reconcile because we have it for free from k8 which will call the Reconciler until all the racks (StatefulSets) are reconciled. So we only care for finding any 1 rack that needs reconciling and do that.

Also, at the moment all this wouldn't support upgrades or conf changes (such as different image url/version or hardware changes or whatever), as if there is no change in replica numbers, nothing will be done. I will remove that check and will see how much work needed for taking care of that (considering that for some of those changes we would need rolling restarts of the C* containers). Should we support it for the "first cut"?

Signed-off-by: Alex Lourie <[email protected]>

alourie · 2019-07-29T06:42:17Z

docker/cassandra/entry-point

@@ -12,4 +12,7 @@ do
 done
 )

+# Update the rack settings from env
+sed -i'' "s/rack=.*$/rack=${CASSANDRA_RACK}/g" /etc/cassandra/cassandra-rackdc.properties
+


this is just one possible solution, to see if it works. It does.

pkg/controller/cassandradatacenter/reconciler.go

benbromhead · 2019-07-29T15:19:41Z

deploy/crds/cassandraoperator_v1alpha1_cassandradatacenter_crd.yaml

@@ -46,6 +46,9 @@ spec:
              type: integer
            prometheusSupport:
              type: boolean
+            racks:


How do we know which racks the c* cluster is using? We need to make sure we can assign to the correct user defined fault domains in Kubernetes. See https://kubernetes.io/docs/setup/best-practices/multiple-zones/

Each stateful set should apply labels via the pod template to assign which failure domain the stateful set is applied to. See https://kubernetes.io/docs/reference/kubernetes-api/labels-annotations-taints/#failure-domainbetakubernetesiozone

In terms of the CRD definition, the user should at a minimum be able to specify which failure domains are available to the cluster. This way we can create the right number of statefulsets and ensure the correct labels are applied to them.

See https://kubernetes.io/docs/reference/kubernetes-api/labels-annotations-taints/#failure-domainbetakubernetesiozone

Signed-off-by: Alex Lourie <[email protected]>

alourie · 2019-08-09T14:25:03Z

@z @benbromhead the last version seems to be a working model with 1 configmap and stateful set per rack. The rack selection is performed based on user input and used as label selectors on the nodes, distribution is calculated automatically and applied separately. The rack scaleup/decommission is performed in a round-robin algorithm.

alourie · 2019-08-09T14:25:35Z

after merge conflict resolved, ready to merge.

Signed-off-by: Alex Lourie <[email protected]>

…assandra-operator into multiRack

Signed-off-by: Alex Lourie <[email protected]>

…assandra-operator into multiRack

Signed-off-by: Alex Lourie <[email protected]>

alourie · 2019-08-20T05:20:57Z

merge conflicts resolved again, can squash and merge.

benbromhead · 2019-08-21T14:11:33Z

@alourie can you rebase and I'll merge

benbromhead · 2019-08-21T14:11:48Z

Due to PR for secrets etc

…assandra-operator into multiRack

Signed-off-by: Alex Lourie <[email protected]>

alourie added 12 commits June 20, 2019 10:14

Sync java and go sidecar clients

224708b

Signed-off-by: Alex Lourie <[email protected]>

Fixes

fc564ec

Signed-off-by: Alex Lourie <[email protected]>

Updates and refactoring

bf8fd5d

Signed-off-by: Alex Lourie <[email protected]>

go fmt

ba4857c

Signed-off-by: Alex Lourie <[email protected]>

updates

370ff1d

Signed-off-by: Alex Lourie <[email protected]>

Merge remote-tracking branch 'origin/superdupertopsecretrewrite' into…

20d156a

… backups

Fixes and updates

2560217

Signed-off-by: Alex Lourie <[email protected]>

Fixed issues, now all seems in order

1c1bd0e

Signed-off-by: Alex Lourie <[email protected]>

Merge remote-tracking branch 'origin/superdupertopsecretrewrite' into…

1fbe4ac

… multiRack

Initial multi-rack support

06d8daa

* Support multi-racks * Probably doesn't support scaling down properly Signed-off-by: Alex Lourie <[email protected]>

go fmt

bad05c2

Signed-off-by: Alex Lourie <[email protected]>

alourie added 7 commits July 5, 2019 15:18

TODOs

2e28c6d

Signed-off-by: Alex Lourie <[email protected]>

Refactoring and simplification

5056101

Signed-off-by: Alex Lourie <[email protected]>

Merge remote-tracking branch 'origin/superdupertopsecretrewrite' into…

5eb3379

… multiRack

Updates, automatic random nodes/racks distribution

37bfc55

Signed-off-by: Alex Lourie <[email protected]>

Small fixes

c337d5f

Signed-off-by: Alex Lourie <[email protected]>

Merge branch 'superdupertopsecretrewrite' of github.com:instaclustr/c…

ea691c0

…assandra-operator into multiRack

Test updates for mutliRack support

cb4a758

Signed-off-by: Alex Lourie <[email protected]>

zegelin reviewed Jul 26, 2019

View reviewed changes

Addressed comments

9099bbf

Signed-off-by: Alex Lourie <[email protected]>

alourie commented Jul 29, 2019

View reviewed changes

pkg/controller/cassandradatacenter/reconciler.go Outdated Show resolved Hide resolved

benbromhead reviewed Jul 29, 2019

View reviewed changes

alourie added 2 commits August 7, 2019 22:45

Round robin rack population

4ae0374

Signed-off-by: Alex Lourie <[email protected]>

Working round-robin rack selection for scaling

7ccb97f

Signed-off-by: Alex Lourie <[email protected]>

Updates and fixes

0c2c049

Signed-off-by: Alex Lourie <[email protected]>

alourie added 9 commits August 9, 2019 23:57

Fixes

4c82bae

Signed-off-by: Alex Lourie <[email protected]>

Merge branch 'superdupertopsecretrewrite' of github.com:instaclustr/c…

8082d7a

…assandra-operator into multiRack

cleanup

f407e82

Signed-off-by: Alex Lourie <[email protected]>

small fixes

b39723e

Signed-off-by: Alex Lourie <[email protected]>

Fixing scaling operations for stateful sets

072d356

Signed-off-by: Alex Lourie <[email protected]>

Fix test

496c07c

Signed-off-by: Alex Lourie <[email protected]>

Merge branch 'superdupertopsecretrewrite' of github.com:instaclustr/c…

d2a9998

…assandra-operator into multiRack

Cleanups

d4591b0

Updated entry-point

f726928

Signed-off-by: Alex Lourie <[email protected]>

alourie added 2 commits August 21, 2019 23:55

Merge branch 'superdupertopsecretrewrite' of github.com:instaclustr/c…

7c8b7d5

…assandra-operator into multiRack

Last updates after rebase

3e440c3

Signed-off-by: Alex Lourie <[email protected]>

benbromhead merged commit 1d075a2 into instaclustr:superdupertopsecretrewrite Aug 21, 2019

alourie mentioned this pull request Aug 22, 2019

Multi-AZ topologies #78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi rack support #192

Multi rack support #192

alourie commented Jul 5, 2019

alourie commented Jul 5, 2019

alourie commented Jul 25, 2019

zegelin commented Jul 26, 2019 •

edited

Loading

zegelin Jul 26, 2019

alourie Jul 27, 2019

alourie commented Jul 27, 2019

alourie Jul 29, 2019

benbromhead Jul 29, 2019

alourie commented Aug 9, 2019

alourie commented Aug 9, 2019

alourie commented Aug 20, 2019

benbromhead commented Aug 21, 2019

benbromhead commented Aug 21, 2019

Multi rack support #192

Multi rack support #192

Conversation

alourie commented Jul 5, 2019

alourie commented Jul 5, 2019

alourie commented Jul 25, 2019

zegelin commented Jul 26, 2019 • edited Loading

zegelin Jul 26, 2019

Choose a reason for hiding this comment

alourie Jul 27, 2019

Choose a reason for hiding this comment

alourie commented Jul 27, 2019

alourie Jul 29, 2019

Choose a reason for hiding this comment

benbromhead Jul 29, 2019

Choose a reason for hiding this comment

alourie commented Aug 9, 2019

alourie commented Aug 9, 2019

alourie commented Aug 20, 2019

benbromhead commented Aug 21, 2019

benbromhead commented Aug 21, 2019

zegelin commented Jul 26, 2019 •

edited

Loading