Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 84 additions & 1 deletion CONVENTIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ components that make up the project, but exceptions are allowed when necessary.

The conventions are intended to help with consistency across the project. Users
of the platform expect consistency in the experience and operation of the cluster.
Developers of the platform expect consistency in the code to quickly identify issuses
Developers of the platform expect consistency in the code to quickly identify issues
across codebases. Consistency enables shared understanding, simplifies explanations,
and reduces accidental complexity.

Expand Down Expand Up @@ -85,3 +85,86 @@ metal3-io](https://github.com/metal3-io/metal3-docs/blob/master/design/bare-meta
### API

OpenShift APIs follow the [Kubernetes API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md).

### Operators

The OpenShift project is an early adopter of, and makes extensive use of, [the
operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/),
and so it is incumbent on us to establish some conventions around operators.

#### Taints and Tolerations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning You worked a lot on tolerations as I recall. ptal


An operator deployed by the CVO should run on master nodes and therefore should
tolerate the following taint:

* `node-role.kubernetes.io/master`

For example:

```yaml
spec:
template:
spec:
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
```

Tolerating this taint should suffice for the vast majority of core OpenShift
operators. In exceptional cases, an operator may tolerate one or more of the
following taints if doing so is necessary to form a functional Kubernetes node:

* `node.kubernetes.io/disk-pressure`
* `node.kubernetes.io/memory-pressure`
* `node.kubernetes.io/network-unavailable`
* `node.kubernetes.io/not-ready`
* `node.kubernetes.io/pid-pressure`
* `node.kubernetes.io/unreachable`

Operators should not specify tolerations in their manifests for any of the taints in
the above list without an explicit and credible justification.

When an operator configures its operand, the operator likewise may specify
tolerations for the aforementioned taints but should do so only as necessary and only
with explicit justification.

Note that the DefaultTolerationSeconds and PodTolerationRestriction admission plugins
may add time-bound tolerations to an operator or operand in addition to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember @sjenning getting worked up over this, but I have no memory of what the details were

tolerations that the operator has specified.

If appropriate, a CRD that corresponds to an operand may provide an API to allow
users to specify a custom list of tolerations for that operand. For examples, see
the
[imagepruners.imageregistry.operator.openshift.io/v1](https://github.com/openshift/api/blob/34f54f12813aaed8822bb5bc56e97cbbfa92171d/imageregistry/v1/types_imagepruner.go#L67-L69),
[configs.imageregistry.operator.openshift.io/v1](https://github.com/openshift/api/blob/34f54f12813aaed8822bb5bc56e97cbbfa92171d/imageregistry/v1/types.go#L82-L84),
[builds.config.openshift.io/v1](https://github.com/openshift/api/blob/34f54f12813aaed8822bb5bc56e97cbbfa92171d/config/v1/types_build.go#L96-L99),
and
[ingresscontrollers.operator.openshift.io/v1](https://github.com/openshift/api/blob/34f54f12813aaed8822bb5bc56e97cbbfa92171d/operator/v1/types_ingress.go#L183-L191)
APIs.

In exceptional cases, an operand may tolerate all taints:

* if the operand is required to form a functional Kubernetes node, or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for being late on reviewing this PR.

To me this doesn't warrant operand to have blanket tolerations. This means we should be explicit on the tolerations that need to matched for the pod to keep running.

Some of the side effects that we noticed with this blanket toleration are

  • node controller wouldn't evict the pods when the node is marked as not ready. We faced this in the past. So allowing allowing blanket tolerations might cause problems in future.
  • In future, we'd like to add Windows nodes to OpenShift clusters. The current way to ensure that the Windows workloads lands on Windows nodes and linux workloads lands on linux nodes is via nodeSelector and taints&tolerations. In a way, with this we are allowing operand or operator to tolerate Windows Specific taints and let them land on the Windows nodes which will cause problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ravisantoshgudimetla. Having an operand tolerate all taints will definitely cause issues on clusters that have enabled Windows workloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this doesn't warrant operand to have blanket tolerations. This means we should be explicit on the tolerations that need to matched for the pod to keep running.

Users can set arbitrary taints, so prohibiting operands from specifying blanket tolerations would require either (a) requiring users to specify operands' tolerations too or (b) restricting the user to some set of predefined taints.

Some of the side effects that we noticed with this blanket toleration are

  • node controller wouldn't evict the pods when the node is marked as not ready. We faced this in the past. So allowing allowing blanket tolerations might cause problems in future.

That Bugzilla report seems to be about operators, not about operands that need to run on each node. For example, cluster-dns-operator only needs to run on some node somewhere, but the dns-node-resolver container in its daemonset operand must run any given node in order for that node to be fully functional; does specifying a blanket toleration on that operand cause a problem?

  • In future, we'd like to add Windows nodes to OpenShift clusters. The current way to ensure that the Windows workloads lands on Windows nodes and linux workloads lands on linux nodes is via nodeSelector and taints&tolerations. In a way, with this we are allowing operand or operator to tolerate Windows Specific taints and let them land on the Windows nodes which will cause problems.

Why do you need both a node selector and a taint? Why isn't a node selector sufficient? For example, cluster-dns-operator creates a daemonset that tolerates all taints but also specifies a node selector: kubernetes.io/os: linux. Would it help to add a section on node selectors with a statement that an operator or operand that requires a specific operating system must specify an appropriate node selector?

* if the operand is required to support workloads sourced from an internal or external registry that core components depend upon,

then the operand should tolerate all taints:

```yaml
spec:
template:
spec:
tolerations:
- operator: Exists
```

An example of an operand that matches the first case is kube-proxy, which is required
for services to work. An example of an operand that matches the second case is the
DNS node resolver, which adds an entry to the `/etc/hosts` file on all node hosts so
that the container runtime is able to resolve the name of the cluster image registry;
absent this entry in `/etc/hosts`, upgrades could fail to pull images of core
components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think following what's written here, we need to include the MCO as it's responsible for the overall node lifecycle and thus, the node cannot be properly configured otherwise (kubelet among others) https://bugzilla.redhat.com/show_bug.cgi?id=1780318


If an operand meets neither of the two conditions listed above, it must not tolerate
all taints. This constraint is enforced by [a CI test
job](https://github.com/openshift/origin/blob/7d07adcf518a846b898cd0958b85f2daf624476a/test/extended/operators/tolerations.go).