Skip to content

Conversation

@ravisantoshgudimetla
Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla commented Jul 8, 2019

- What I did
Added logic in node controller to watch for scheduler CR changes and update the master nodes to make them schedulable and unschedulable

- How to verify it

I did not test it. I want to check if the approach seems reasonable before I proceed further. Please review the c82786d

The e2e test verifies if the worker label has been added and master taint has been removed for all the masters when scheduler CR's mastersSchedulable field has been set or unset. That field was added to Scheduler CRD here openshift/api#366

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 8, 2019
node.Spec.Taints = newTaints
})
if err != nil {
glog.Errorf("error making node %v schedulable with error %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule, rather than logging and continuing, please return the error and let the higher level logic retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it might be different - we cannot return error when it fails for a single node, perhaps we need to capture slice of errors for individual nodes and return. I don't have a strong preference but I see your point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot return error when it fails for a single node,

Why not?

perhaps we need to capture slice of errors for individual nodes and return.

Yes, that's a better pattern indeed, but I think "fast fail for single item" is still better than "log and continue".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have done slice collection and then returned because in the next iteration it could cause problems with other node. Let me know, if you prefer it other way

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2019
@ravisantoshgudimetla ravisantoshgudimetla force-pushed the add-scheduler-watcher branch 2 times, most recently from 545902f to d23ccf7 Compare July 9, 2019 18:53
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2019
@@ -14,5 +14,5 @@ rules:
resources: ["configmaps", "secrets"]
verbs: ["*"]
- apiGroups: ["config.openshift.io"]
resources: ["images", "clusterversions", "featuregates"]
resources: ["images", "clusterversions", "featuregates", "schedulers"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be changed futher to just list, instead of giving *

@runcom
Copy link
Member

runcom commented Jul 10, 2019

@ravisantoshgudimetla this patch looks good - is there some testing I can perform manually?

@runcom
Copy link
Member

runcom commented Jul 10, 2019

--- FAIL: TestMastersSchedulable (1.15s)
    mco_test.go:52: Error while updating scheduler CR

@ravisantoshgudimetla
Copy link
Contributor Author

/test e2e-aws-upgrade

1 similar comment
@ravisantoshgudimetla
Copy link
Contributor Author

/test e2e-aws-upgrade

@ravisantoshgudimetla ravisantoshgudimetla force-pushed the add-scheduler-watcher branch 2 times, most recently from 364cabb to 2cc6a1e Compare July 10, 2019 21:47
@ravisantoshgudimetla
Copy link
Contributor Author

/retest

@ravisantoshgudimetla
Copy link
Contributor Author

/retest

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/deployments.go:167]: Unexpected error:

fail [k8s.io/kubernetes/test/e2e/framework/framework.go:338]: Jul 11 01:26:53.041: Couldn't delete ns: "svcaccounts-4488": namespace svcaccounts-4488 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace svcaccounts-4488 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})

@ravisantoshgudimetla
Copy link
Contributor Author

Jul 11 01:55:14.869 E clusterversion/version changed Failing to True: UpdatePayloadFailed: Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (359 of 401)

@runcom
Copy link
Member

runcom commented Jul 11, 2019

/test e2e-aws

verify needs some care

@ravisantoshgudimetla ravisantoshgudimetla force-pushed the add-scheduler-watcher branch 2 times, most recently from 57995a9 to f6b7b5f Compare July 11, 2019 12:29
@cgwalters
Copy link
Member

cgwalters commented Jul 11, 2019

Overall this looks good, thanks for working on it! Can you please squash it into one commit - since this is really one logical change. And also now the last commit has intermixed fixups from review. And please write a commit message, something like:

Watch scheduler CR and make masters schedulable accordingly

By default, the MCO's kubelet configuration injects a taint to disable scheduling. This adds support in the MCO for watching the API recently added to allow the masters to be schedulable. openshift/api#366

Use cases here are for 3 node bare metal clusters (which have significant resources on the masters), as well as CodeReady Containers which wants to make a single node OpenShift as a VM.

Closes: #763

@ravisantoshgudimetla
Copy link
Contributor Author

Thanks for the review @runcom @cgwalters @kikisdeliveryservice. I made the changes suggested. PTAL

@ravisantoshgudimetla
Copy link
Contributor Author

/retest

@kikisdeliveryservice
Copy link
Contributor

this lgtm

will let @cgwalters / @runcom give the final approval

@runcom
Copy link
Member

runcom commented Jul 12, 2019

needs a rebase :(

@cgwalters
Copy link
Member

needs a rebase :(

The bright side is it's an opportunity to use the 🏄‍♂️ emoji

@ravisantoshgudimetla
Copy link
Contributor Author

Rebased 🙈

@kikisdeliveryservice
Copy link
Contributor

ci didn't like that last commit :(

@kikisdeliveryservice
Copy link
Contributor

uhoh hitting limits now:

./tmp/openshift-install-050178053/vpc/master-elb.tf line 1, in resource \"aws_lb\" \"api_internal\":"

@kikisdeliveryservice
Copy link
Contributor

reported but ci has been hitting this for the last few hours.

@ravisantoshgudimetla
Copy link
Contributor Author

/retest

@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2019
By default, the MCO's kubelet configuration injects a taint to disable scheduling. This adds
support in the MCO for watching the API recently added to allow the masters
to be schedulable. openshift/api#366

Use cases here are for 3 node bare metal clusters (which have significant resources on the masters),
as well as CodeReady Containers which wants to make a single node OpenShift as a VM.

Closes: openshift#763
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2019
@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, ravisantoshgudimetla, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Jul 12, 2019

Still seeing ci problems:

level=error
level=error msg="  on ../tmp/openshift-install-001008030/vpc/master-elb.tf line 19, in resource \"aws_lb\" \"api_external\":"
level=error msg="  19: resource \"aws_lb\" \"api_external\" {" 

/retest

@ravisantoshgudimetla
Copy link
Contributor Author

/retest

@openshift-merge-robot openshift-merge-robot merged commit 8367a76 into openshift:master Jul 13, 2019
russellb added a commit to russellb/machine-config-operator that referenced this pull request Oct 16, 2019
The purpose of this change was to make masters able to run workloads
by default.  This is needed to complete a successful deployment of a
3-node bare metal install.  This particular approach was only short
term, while better interfaces were developed to control this behavior.

The scheduler configuration resource now includes a
"mastersSchedulable" boolean, enabled here:

openshift#937

This installer PR made it the default behavior if no workers were
defined at install time:

openshift/installer#2004

With these changes in place, the custom kubelet config for the
baremetal platform is no longer necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants