Skip to content

Comments

Add validation webhook deployment#62

Merged
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
jsafrane:add-webhook
Dec 4, 2020
Merged

Add validation webhook deployment#62
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
jsafrane:add-webhook

Conversation

@jsafrane
Copy link
Contributor

@jsafrane jsafrane commented Nov 9, 2020

Add deployment of upstream validation webhook as a separate Controller that handles webhook Deployment.

  • Move OperatorClient to a standalone package/
  • Since there are two controllers running in parallel now, make sure that whichever controller finishes the first sync, it sets also Unknown condition of the other controller. Otherwise the whole operator could be Available: true too early.

ValidatingWebhookConfiguration is deployed by CVO, which is technically wrong - it will overwrite caBundle injected by another operator. There is https://github.com/openshift/library-go/pull/836/files in progress that should fix this issue.

WIP:

  • get downstream image and update related-images We have the images now.

@openshift/storage

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 9, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2020
@jsafrane
Copy link
Contributor Author

jsafrane commented Nov 9, 2020

/retest

@jsafrane jsafrane force-pushed the add-webhook branch 5 times, most recently from 2b3e04a to 58d787f Compare November 30, 2020 14:10
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2020
@jsafrane
Copy link
Contributor Author

jsafrane commented Dec 2, 2020

/retest

@jsafrane jsafrane changed the title WIP: Add validation webhook deployment \Add validation webhook deployment Dec 2, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2020
@jsafrane jsafrane changed the title \Add validation webhook deployment Add validation webhook deployment Dec 2, 2020
@jsafrane
Copy link
Contributor Author

jsafrane commented Dec 2, 2020

/hold
we need ART images built, should be tomorrow.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2020
labels:
app: csi-snapshot-webhook
annotations:
service.beta.openshift.io/inject-cabundle: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to specify the self-managed-high-availability profile? Same question for the Service manifest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


queue workqueue.RateLimitingInterface

stopCh <-chan struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary since we're using library-go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

opSpec, opStatus, _, err := c.client.GetOperatorState()
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's a good practice to check for NotFound errors here (even though we don't check in the other controller)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return factory.New().WithSync(c.sync).WithSyncDegradedOnError(client).WithInformers(
client.Informer(),
deployInformer.Informer(),
).ToController(WebhookControllerName, eventRecorder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be helpful to add a suffix to this recorder in order to distinguish this controller (eventrecorder.WithComponentSuffix(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

return err
}
lastGeneration := resourcemerge.ExpectedDeploymentGeneration(deployment, opStatus.Generations)
deployment, _, err = resourceapply.ApplyDeployment(c.kubeClient.AppsV1(), c.eventRecorder, deployment, lastGeneration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use syncContext.Recorder() to get the right event recover (and also get rid of the one specified in csiSnapshotWebhookController

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

deploymentAvailable.Status = operatorapi.ConditionTrue
} else {
deploymentAvailable.Status = operatorapi.ConditionFalse
deploymentAvailable.Reason = "WaitDeployment"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In library-go the reason was set to Deploying: bertinatto/library-go@fa1bce1

Do we want to keep the Reason consistent with CSI driver operators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used Deploying everywhere

}
if deployment.Status.ObservedGeneration != deployment.Generation {
deploymentProgressing.Status = operatorapi.ConditionTrue
deploymentProgressing.Reason = "NewGeneration"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about the Reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used Deploying everywhere

if deployment.Status.UpdatedReplicas == *deployment.Spec.Replicas {
deploymentProgressing.Status = operatorapi.ConditionFalse
// All replicas were updated, set the version
c.versionGetter.SetVersion(webhookVersionName, c.operandVersion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory Available could still be false here. In that case, setting the operand version here would be incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should only report operator version only after all operands have been rolled out:

https://github.com/openshift/api/blob/master/config/v1/types_cluster_operator.go#L45

However, currently we report the operator version in the other controller, which has no idea if the webhook operand has been rolled out yet.

How can we solve this? Another controller only to set the operator version? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sets its own version, webhookVersionName - versions is an array. So there are two distinct version, which can should converge to the same value. Honestly, I don't know what CVO does with this array :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used Deploying everywhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I'm concerned about 2 things:

  1. Setting webhookVersionName should be OK here, but I think we need to make sure we do so only when Progressing=false and Available=true. Currently it's setting when Progressing=false, which is too soon.

  2. Since we're introducing a new operand (the webhook), we need to change this part of the code to only set the operator when both operands (webhook and the snapshot-controller) have rolled out (currently it only accounts for the snapshot-controller):

https://github.com/openshift/cluster-csi-snapshot-controller-operator/blob/master/pkg/operator/sync.go#L170

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely removed the version, I need to think about it more.

}

updateGenerationFn := func(newStatus *operatorapi.OperatorStatus) error {
if deployment != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it have panicked above if deployment were nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would :-)
Removed the check... What could possibly go wrong, right? :-D

); err != nil {
return err
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just return err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tolerations:
- key: "node-role.kubernetes.io/master"
operator: Exists
effect: NoSchedule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CriticalAddonsOnly toleration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the other operand deployment has a few other tolerations. I suppose it makes sense they are consistent?

- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120
- key: node-role.kubernetes.io/master
operator: Exists
effect: "NoSchedule"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied tolerations from the controller

resources:
requests:
cpu: 10m
serviceAccountName: csi-snapshot-controller-operator
Copy link
Member

@bertinatto bertinatto Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we want (csi-snapshot-controller-operator)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The webhook does not talk to Kubernetes API at all. Should it have a separate service account?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this webhook have a SA at all?

The multus webhook doesn't need one: https://github.com/openshift/multus-cni/blob/master/deployment/webhook/deployment.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the service account, it will get the default one from the namespace.

@jsafrane
Copy link
Contributor Author

jsafrane commented Dec 3, 2020

/retest

@jsafrane
Copy link
Contributor Author

jsafrane commented Dec 3, 2020

/hold cancel
Brew says it was built for rhaos-4.7-rhel-8-candidate.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020
@jsafrane jsafrane force-pushed the add-webhook branch 2 times, most recently from f28293e to d5f6409 Compare December 3, 2020 14:13
func TestSync(t *testing.T) {
const replica0 = 0
const replica1 = 1
const replica2 = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're not using these consts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Member

@bertinatto bertinatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020
@bertinatto
Copy link
Member

/retest

Copy link
Member

@bertinatto bertinatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

To check the secret issue.

volumes:
- name: certs
secret:
secretName: csi-snapshot-webhook-secret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this being created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From service.beta.openshift.io/inject-cabundle annotation,

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm Indicates that a PR is ready to be merged. labels Dec 3, 2020
Don't use the controller image for webhook Deployment.
@bertinatto
Copy link
Member

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020
@jsafrane
Copy link
Contributor Author

jsafrane commented Dec 3, 2020

/test e2e-aws

@jsafrane
Copy link
Contributor Author

jsafrane commented Dec 3, 2020

The webhook pod fails to start:

panic: listen tcp :443: bind: permission denied

Use port 8443 to listen for the webhook.
Copy link
Member

@bertinatto bertinatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2020
@openshift-merge-robot openshift-merge-robot merged commit 332a311 into openshift:master Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants