@@ -698,133 +698,29 @@ The deprecation strategy is described in the OVN-Kubernetes
698698
699699## Upgrade / Downgrade Strategy
700700
701- If applicable, how will the component be upgraded and downgraded? Make sure this
702- is in the test plan.
703-
704- Consider the following in developing an upgrade/downgrade strategy for this
705- enhancement:
706- - What changes (in invocations, configurations, API use, etc.) is an existing
707- cluster required to make on upgrade in order to keep previous behavior?
708- - What changes (in invocations, configurations, API use, etc.) is an existing
709- cluster required to make on upgrade in order to make use of the enhancement?
710-
711- Upgrade expectations:
712- - Each component should remain available for user requests and
713- workloads during upgrades. Ensure the components leverage best practices in handling [ voluntary
714- disruption] ( https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ ) . Any exception to
715- this should be identified and discussed here.
716- - Micro version upgrades - users should be able to skip forward versions within a
717- minor release stream without being required to pass through intermediate
718- versions - i.e. ` x.y.N->x.y.N+2 ` should work without requiring ` x.y.N->x.y.N+1 `
719- as an intermediate step.
720- - Minor version upgrades - you only need to support ` x.N->x.N+1 ` upgrade
721- steps. So, for example, it is acceptable to require a user running 4.3 to
722- upgrade to 4.5 with a ` 4.3->4.4 ` step followed by a ` 4.4->4.5 ` step.
723- - While an upgrade is in progress, new component versions should
724- continue to operate correctly in concert with older component
725- versions (aka "version skew"). For example, if a node is down, and
726- an operator is rolling out a daemonset, the old and new daemonset
727- pods must continue to work correctly even while the cluster remains
728- in this partially upgraded state for some time.
729-
730- Downgrade expectations:
731- - If an ` N->N+1 ` upgrade fails mid-way through, or if the ` N+1 ` cluster is
732- misbehaving, it should be possible for the user to rollback to ` N ` . It is
733- acceptable to require some documented manual steps in order to fully restore
734- the downgraded cluster to its previous state. Examples of acceptable steps
735- include:
736- - Deleting any CVO-managed resources added by the new version. The
737- CVO does not currently delete resources that no longer exist in
738- the target version.
701+ N/A
739702
740703## Version Skew Strategy
741704
742705N/A
743706
744707## Operational Aspects of API Extensions
745708
746- Describe the impact of API extensions (mentioned in the proposal section, i.e. CRDs,
747- admission and conversion webhooks, aggregated API servers, finalizers) here in detail,
748- especially how they impact the OCP system architecture and operational aspects.
749-
750- - For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level
751- Indicators) an administrator or support can use to determine the health of the API extensions
752-
753- Examples (metrics, alerts, operator conditions)
754- - authentication-operator condition ` APIServerDegraded=False `
755- - authentication-operator condition ` APIServerAvailable=True `
756- - openshift-authentication/oauth-apiserver deployment and pods health
757-
758- - What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput,
759- API availability)
709+ The proposed ` IPPool ` CRD must be provisioned by the admin (or the source
710+ cluster introspection tool) before the VMs are migrated into OpenShift virt,
711+ otherwise, they will lose the IP addresses they had on the source cluster.
760712
761- Examples:
762- - Adds 1s to every pod update in the system, slowing down pod scheduling by 5s on average.
763- - Fails creation of ConfigMap in the system when the webhook is not available.
764- - Adds a dependency on the SDN service network for all resources, risking API availability in case
765- of SDN issues.
766- - Expected use-cases require less than 1000 instances of the CRD, not impacting
767- general API throughput.
713+ The gateway for the network must be configured in the cluster UDN CR at
714+ creation time, as any other cluster UDN parameter.
768715
769- - How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or
770- automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review
771- this enhancement)
772-
773- - Describe the possible failure modes of the API extensions.
774- - Describe how a failure or behaviour of the extension will impact the overall cluster health
775- (e.g. which kube-controller-manager functionality will stop working), especially regarding
776- stability, availability, performance and security.
777- - Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes
778- and add them as reviewers to this enhancement.
716+ Hence, some planning and preparation are required from the admin before the
717+ VM owner starts importing VMs into the OpenShift Virt cluster via MTV.
779718
780719## Support Procedures
781720
782- Describe how to
783- - detect the failure modes in a support situation, describe possible symptoms (events, metrics,
784- alerts, which log output in which component)
785-
786- Examples:
787- - If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
788- - Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
789- - The metric ` webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false") `
790- will show >1s latency and alert ` WebhookAdmissionLatencyHigh ` will fire.
791-
792- - disable the API extension (e.g. remove MutatingWebhookConfiguration ` xyz ` , remove APIService ` foo ` )
793-
794- - What consequences does it have on the cluster health?
795-
796- Examples:
797- - Garbage collection in kube-controller-manager will stop working.
798- - Quota will be wrongly computed.
799- - Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
800- Disabling the conversion webhook will break garbage collection.
801-
802- - What consequences does it have on existing, running workloads?
803-
804- Examples:
805- - New namespaces won't get the finalizer "xyz" and hence might leak resource X
806- when deleted.
807- - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
808- communication after some minutes.
809-
810- - What consequences does it have for newly created workloads?
811-
812- Examples:
813- - New pods in namespace with Istio support will not get sidecars injected, breaking
814- their networking.
815-
816- - Does functionality fail gracefully and will work resume when re-enabled without risking
817- consistency?
818-
819- Examples:
820- - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
821- will not block the creation or updates on objects when it fails. When the
822- webhook comes back online, there is a controller reconciling all objects, applying
823- labels that were not applied during admission webhook downtime.
824- - Namespaces deletion will not delete all objects in etcd, leading to zombie
825- objects when another namespace with the same name is created.
721+ TODO
826722
827723## Infrastructure Needed [ optional]
828724
829- Use this section if you need things from the project. Examples include a new
830- subproject, repos requested, github details, and/or testing infrastructure .
725+ We'll need a virt-aware lane with CNV (and MTV) installed so we can e2e test
726+ the features .
0 commit comments