-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statefulsets for cinder: allow multi-AZ deployments, spread pods across zones #44798
Conversation
Hi @zetaab. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
actually I was wrong, the cloudprovider is adding the labels nowadays for zones: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go#L284 so in GetAllZones() function, I should maybe filter through nodes and check those labels, instead of calling openstack API? If so, which library i should use and how? |
@liggitt could you take a look, it looks like rootfs has been away for a while. |
I'm not the right person to review this... I'd suggest some intersection of @kubernetes/sig-storage-pr-reviews and @kubernetes/sig-openstack-pr-reviews |
And @kubernetes/sig-apps-api-reviews since this affects statefulsets |
Leaving default zone to nova is a problem. I find AWS/GCE uses a heuristics that hashes PVC name to a zone. Cinder could use the same idea for this issue. |
also see #41598 |
Hi @rootfs this PR actually uses that same heuristics thing what is used in AWS/GCE. So if you do not specify AZ for cinder storageclass it uses that same algorithm to choose correct zone. By using that algorithm and labeling PVs we can achieve situation to spread pods to multiple zones |
@zetaab I see it now, cinder_util.go is modified twice in 1st and 3rd commits. Can you merge these two commits? Please also use camel case for consistency. @kubernetes/sig-openstack-pr-reviews please review gophercloud update. |
@justinsb for drive by on az spreading of volumes |
@rootfs @justinsb It uses the same function, ChooseZoneForVolume, used by the GCE and AWS CloudProviders, to implement the round robin balancing. It will have, for better or worse, the same benefits and limitations. |
@rootfs 1st and 3rd commit now merged. I did not fully get where I am missing camelcase? |
@kow3ns yes, the volumes are zonal resources and that is why I added labels to PVs. After PV has zone label, the pod is automatically provisioned in same zone(I tested it with shutdowning all nodes from zone and after that it could not provision pod because there were no available nodes with those labels). The bad thing which exist in current solution is that, dynamic pvs does not have labels in cinder. In our case it means that pv can be provisioned for example zone x, and then pod itself is started in zone y. However, because our cinder is zonal resource it does not work at all. Although this can be allowed in cinder, so in some usecases people can use cinder resources from another zone. In my opinion that is bad way of doing things, it creates unnecessary network traffic between zones. If people really want provision pv to zone x and run pod in zone y, they need remove PV labels. But as I see this is really bad way of doing things? What you think? Should we have way to define is that code adding pv labels or not? |
pkg/volume/cinder/cinder_util.go
Outdated
glog.V(2).Infof("Successfully created cinder volume %s", name) | ||
return name, volSizeGB, nil | ||
|
||
volume_id, volume_az, errr := cloud.CreateVolume(name, volSizeGB, vtype, availability, c.options.CloudTags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: volumeId, volumeAZ
@k8s-bot ok to test |
Cinder fix lgtm. Anybody from @kubernetes/sig-openstack-pr-reviews want to check gophercloud change? |
@pospispa I can update it, it should be something like: availability: Availability Zone. If not specified, the zone is chosen by algorithm from those where Kubernetes cluster has a node. zetaab/kubernetes.github.io@b5b384d I will do PR after this is merged |
update aws+gce availability texts, the algorithm is not random
/lgtm |
/assign @jsafrane @mikedanese |
/approve |
@jsafrane could you please approve |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, jsafrane, mikedanese, pospispa, zetaab
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue |
@zetaab please, will you create the corresponding documentation PR for this PR? |
@pospispa its done already kubernetes/website#3693 |
…in Storage Class is not Configured Backport of Kubernetes PR #44798 (kubernetes/kubernetes#44798). In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect. Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster. Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience.
…in Storage Class is not Configured Backport of Kubernetes PR kubernetes#44798 (kubernetes#44798). In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect. Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster. Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience. :100644 100644 2f91722b65... d4a75b3a40... M pkg/cloudprovider/providers/openstack/openstack_test.go :100644 100644 68e35d520f... cdd1c8d2f4... M pkg/cloudprovider/providers/openstack/openstack_volumes.go :100644 100644 5939ba3a2e... 052c3e1e92... M pkg/cloudprovider/providers/rackspace/rackspace.go :100644 100644 c9a7fc435b... eb5e1d20e2... M pkg/volume/cinder/BUILD :100644 100644 d45b99784d... 956f7a1088... M pkg/volume/cinder/attacher_test.go :100644 100644 047e735568... 06d4528434... M pkg/volume/cinder/cinder.go :100644 100644 7b1735fa69... 66acbcd22e... M pkg/volume/cinder/cinder_test.go :100644 100644 5b0402afd6... c0013e29e3... M pkg/volume/cinder/cinder_util.go
…in Storage Class is not Configured Backport of Kubernetes PR kubernetes#44798 (kubernetes#44798). In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect. Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster. Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience. :100644 100644 2f91722b65... d4a75b3a40... M pkg/cloudprovider/providers/openstack/openstack_test.go :100644 100644 68e35d520f... cdd1c8d2f4... M pkg/cloudprovider/providers/openstack/openstack_volumes.go :100644 100644 5939ba3a2e... 052c3e1e92... M pkg/cloudprovider/providers/rackspace/rackspace.go :100644 100644 c9a7fc435b... eb5e1d20e2... M pkg/volume/cinder/BUILD :100644 100644 d45b99784d... 956f7a1088... M pkg/volume/cinder/attacher_test.go :100644 100644 047e735568... 06d4528434... M pkg/volume/cinder/cinder.go :100644 100644 7b1735fa69... 66acbcd22e... M pkg/volume/cinder/cinder_test.go :100644 100644 5b0402afd6... c0013e29e3... M pkg/volume/cinder/cinder_util.go
…in Storage Class is not Configured Backport of Kubernetes PR kubernetes#44798 (kubernetes#44798). In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect. Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster. Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience. :100644 100644 2f91722b65... d4a75b3a40... M pkg/cloudprovider/providers/openstack/openstack_test.go :100644 100644 68e35d520f... cdd1c8d2f4... M pkg/cloudprovider/providers/openstack/openstack_volumes.go :100644 100644 5939ba3a2e... 052c3e1e92... M pkg/cloudprovider/providers/rackspace/rackspace.go :100644 100644 c9a7fc435b... eb5e1d20e2... M pkg/volume/cinder/BUILD :100644 100644 d45b99784d... 956f7a1088... M pkg/volume/cinder/attacher_test.go :100644 100644 047e735568... 06d4528434... M pkg/volume/cinder/cinder.go :100644 100644 7b1735fa69... 66acbcd22e... M pkg/volume/cinder/cinder_test.go :100644 100644 5b0402afd6... c0013e29e3... M pkg/volume/cinder/cinder_util.go
…in Storage Class is not Configured Backport of Kubernetes PR kubernetes#44798 (kubernetes#44798). In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect. Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster. Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience. :100644 100644 2f91722b65... d4a75b3a40... M pkg/cloudprovider/providers/openstack/openstack_test.go :100644 100644 68e35d520f... cdd1c8d2f4... M pkg/cloudprovider/providers/openstack/openstack_volumes.go :100644 100644 5939ba3a2e... 052c3e1e92... M pkg/cloudprovider/providers/rackspace/rackspace.go :100644 100644 c9a7fc435b... eb5e1d20e2... M pkg/volume/cinder/BUILD :100644 100644 d45b99784d... 956f7a1088... M pkg/volume/cinder/attacher_test.go :100644 100644 047e735568... 06d4528434... M pkg/volume/cinder/cinder.go :100644 100644 7b1735fa69... 66acbcd22e... M pkg/volume/cinder/cinder_test.go :100644 100644 5b0402afd6... c0013e29e3... M pkg/volume/cinder/cinder_util.go
Automatic merge from submit-queue Statefulsets for cinder: allow multi-AZ deployments, spread pods across zones **What this PR does / why we need it**: Currently if we do not specify availability zone in cinder storageclass, the cinder is provisioned to zone called nova. However, like mentioned in issue, we have situation that we want spread statefulset across 3 different zones. Currently this is not possible with statefulsets and cinder storageclass. In this new solution, if we leave it empty the algorithm will choose the zone for the cinder drive similar style like in aws and gce storageclass solutions. **Which issue this PR fixes** fixes kubernetes#44735 **Special notes for your reviewer**: example: ``` kind: StorageClass apiVersion: storage.k8s.io/v1beta1 metadata: name: all provisioner: kubernetes.io/cinder --- apiVersion: v1 kind: Service metadata: annotations: service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" name: galera labels: app: mysql spec: ports: - port: 3306 name: mysql clusterIP: None selector: app: mysql --- apiVersion: apps/v1beta1 kind: StatefulSet metadata: name: mysql spec: serviceName: "galera" replicas: 3 template: metadata: labels: app: mysql annotations: pod.alpha.kubernetes.io/initialized: "true" spec: containers: - name: mysql image: adfinissygroup/k8s-mariadb-galera-centos:v002 imagePullPolicy: Always ports: - containerPort: 3306 name: mysql - containerPort: 4444 name: sst - containerPort: 4567 name: replication - containerPort: 4568 name: ist volumeMounts: - name: storage mountPath: /data readinessProbe: exec: command: - /usr/share/container-scripts/mysql/readiness-probe.sh initialDelaySeconds: 15 timeoutSeconds: 5 env: - name: POD_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace volumeClaimTemplates: - metadata: name: storage annotations: volume.beta.kubernetes.io/storage-class: all spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 12Gi ``` If this example is deployed it will automatically create one replica per AZ. This helps us a lot making HA databases. Current storageclass for cinder is not perfect in case of statefulsets. Lets assume that cinder storageclass is defined to be in zone called nova, but because labels are not added to pv - pods can be started in any zone. The problem is that at least in our openstack it is not possible to use cinder drive located in zone x from zone y. However, should we have possibility to choose between cross-zone cinder mounts or not? Imo it is not good way of doing things that they mount volume from another zone where the pod is located(means more network traffic between zones)? What you think? Current new solution does not allow that anymore (should we have possibility to allow it? it means removing the labels from pv). There might be some things that needs to be fixed still in this release and I need help for that. Some parts of the code is not perfect. Issues what i am thinking about (I need some help for these): 1) Can everybody see in openstack what AZ their servers are? Can there be like access policy that do not show that? If AZ is not found from server specs, I have no idea how the code behaves. 2) In GetAllZones() function, is it really needed to make new serviceclient using openstack.NewComputeV2 or could I somehow use existing one 3) This fetches all servers from some openstack tenant(project). However, in some cases kubernetes is maybe deployed only to specific zone. If kube servers are located for instance in zone 1, and then there are another servers in same tenant in zone 2. There might be usecase that cinder drive is provisioned to zone-2 but it cannot start pod, because kubernetes does not have any nodes in zone-2. Could we have better way to fetch kubernetes nodes zones? Currently that information is not added to kubernetes node labels automatically in openstack (which should I think). I have added those labels manually to nodes. If that zone information is not added to nodes, the new solution does not start stateful pods at all, because it cannot target pods. cc @rootfs @anguslees @jsafrane ```release-note Default behaviour in cinder storageclass is changed. If availability is not specified, the zone is chosen by algorithm. It makes possible to spread stateful pods across many zones. ```
What this PR does / why we need it: Currently if we do not specify availability zone in cinder storageclass, the cinder is provisioned to zone called nova. However, like mentioned in issue, we have situation that we want spread statefulset across 3 different zones. Currently this is not possible with statefulsets and cinder storageclass. In this new solution, if we leave it empty the algorithm will choose the zone for the cinder drive similar style like in aws and gce storageclass solutions.
Which issue this PR fixes fixes #44735
Special notes for your reviewer:
example:
If this example is deployed it will automatically create one replica per AZ. This helps us a lot making HA databases.
Current storageclass for cinder is not perfect in case of statefulsets. Lets assume that cinder storageclass is defined to be in zone called nova, but because labels are not added to pv - pods can be started in any zone. The problem is that at least in our openstack it is not possible to use cinder drive located in zone x from zone y. However, should we have possibility to choose between cross-zone cinder mounts or not? Imo it is not good way of doing things that they mount volume from another zone where the pod is located(means more network traffic between zones)? What you think? Current new solution does not allow that anymore (should we have possibility to allow it? it means removing the labels from pv).
There might be some things that needs to be fixed still in this release and I need help for that. Some parts of the code is not perfect.
Issues what i am thinking about (I need some help for these):
cc @rootfs @anguslees @jsafrane