Add proposal for Azure Service Operator#3113
Conversation
|
Skipping CI for Draft Pull Request. |
mboersma
left a comment
There was a problem hiding this comment.
First draft is looking good! You thought of all the gotchas that I can think of.
|
|
||
| - Leverage existing e2e tests | ||
| - Add unit tests for new ASO integration | ||
| - Run one-off tests against large clusters to catch performance regressions |
There was a problem hiding this comment.
we should talk about telemetry somewhere. Currently we have traces and metrics for every SDK call made in CAPZ https://capz.sigs.k8s.io/developers/development.html#viewing-telemetry, if we move to ASO we will lose that. @mattchr does ASO currently emit traces/metrics for SDK calls?
There was a problem hiding this comment.
It looks like ASO exposes azure_successful_requests_total, azure_failed_requests_total, and azure_requests_time_seconds Prometheus metrics, but I don't see any OpenTelemetry integration.
There was a problem hiding this comment.
We don't have any OpenTelemetry integration currently. We have prom metrics for every SDK call made, but not traces. As I mentioned in my other comment this is something we'd be open to improving, although I'm not sure how we'd get distributed tracing to work through CRs (so that you could have a top-level trace that spanned N ASO resource creations for example)
|
First draft looks great to me as well, thank you for putting it together! |
nojnhuh
left a comment
There was a problem hiding this comment.
Thanks everyone for the feedback so far! I've addressed that for now in the form of bullet points and will start filling those sections out more.
|
|
||
| - Leverage existing e2e tests | ||
| - Add unit tests for new ASO integration | ||
| - Run one-off tests against large clusters to catch performance regressions |
There was a problem hiding this comment.
It looks like ASO exposes azure_successful_requests_total, azure_failed_requests_total, and azure_requests_time_seconds Prometheus metrics, but I don't see any OpenTelemetry integration.
|
|
||
| ### Graduation Criteria | ||
|
|
||
| ASO integration will not be kept behind a feature flag or matriculate through the usual alpha, beta, and stable phases. Instead, the transition will be made one Azure service interface at a time so as to distribute potential impact over time. |
| Azure or Kubernetes API limits with fewer or smaller workload clusters being managed. | ||
| - Management cluster will have to manage many more Kubernetes resources per | ||
| workload cluster | ||
| - Because ASO has not yet been proven as a mission-critical interface to Azure |
There was a problem hiding this comment.
Well-phrased. I agree with this as a risk.
I think it makes a good bit of sense to make a shared bet. As you called out, ASO is solving the "2. Interfacing with the Azure platform to manage creating, updating, and deleting that infrastructure" problem, so it should end up reducing the work CAPZ has to do on that stuff, but this is a risk as obviously the Azure Go SDK has much broader adoption and is more mature (GA) than ASO is currently.
| used instead of the API or SDK directly | ||
| - Conflicting user installations of ASO or ASO resources | ||
| - Future breaking changes in ASO | ||
| - Lower-fidelity telemetry compared to what CAPZ tracks currently |
There was a problem hiding this comment.
This is something we'd love to work with you guys on I think. We have some basic telemetry exposed already: https://azure.github.io/azure-service-operator/introduction/metrics/ - if you gave us a list of what exactly you wanted (or were losing in this migration) we could work to expose that data.
Or is the issue here more than you had integrations into the Azure SDK to track aggregate metrics such as "time it takes to fully provision a cluster" that you'd be losing?
There was a problem hiding this comment.
For the little bit I've used CAPZ's tracing, I've found it helpful to have a breakdown of how long each step in a single CAPZ reconciliation takes. Since that includes Azure API calls currently, I think my main concern was losing that kind of association between a CAPZ reconciliation and Azure API calls. I updated this section to mention that I don't think that would really matter though since Azure API calls would be happening in ASO completely out-of-band with CAPZ reconciliations. Or at least recreating that mapping seems like it would be unnecessarily difficult.
There was a problem hiding this comment.
There have been discussions about tracking resource lifecycle and some related KEP work: https://groups.google.com/g/kubebuilder/c/tNI6ZpQ2loM/m/8rSX6HKVDgAJ. Correlation is going to be difficult. However, we might be able to trace with observed generation and namespace/name to get something close enough.
|
|
||
| CAPZ interacts with some Azure services that do not represent infrastructure, and thus cannot be represented in ASO. Resource Health, for example, is "reconciled" by CAPZ currently by getting a resource's health status and reflecting that in the corresponding CAPZ resource, but does not create or update any distinct Azure resources. The new SDK could be used to implement this existing functionality without affecting other service interfaces' use of ASO. Implementing Resource Health in ASO is being tracked in https://github.com/Azure/azure-service-operator/issues/2762. | ||
|
|
||
| Also, use of the `clusterctl move` command will require extra manual steps to move ASO resources as documented here: https://azure.github.io/azure-service-operator/introduction/frequently-asked-questions/#what-is-the-best-practice-for-transferring-aso-resources-from-one-cluster-to-another. Specifically, before `clusterctl move` is run, each ASO resource under the ownership hierarchy of a Cluster must have its `serviceoperator.azure.com/reconcile-policy` annotation set to `skip`. The necessary ASO resources can be enumerated by invoking `clusterctl move --dry-run -v 1`. `clusterctl move` will automatically detect and move the ASO resources. Then after `clusterctl move` is complete, the annotation should be changed back to its previous state. |
There was a problem hiding this comment.
Specifically, before
clusterctl moveis run, each ASO resource under the ownership hierarchy of a Cluster must have itsserviceoperator.azure.com/reconcile-policyannotation set toskip
that's not a great experience for users. They shouldn't have to care or even know about ASO as it's an implementation detail of CAPZ and not something they opt into. I think it's okay for a user to have to apply the annotation in the context where they are directly using ASO, but in the case where the CAPZ controller is the one "using" ASO to provision resources, the CAPZ controller should be the one applying these annotations. This might be tricky and might require some changes to clusterctl move but we should really try to avoid manual intervention from the user.
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## main #3113 +/- ##
===========================================
+ Coverage 40.42% 51.50% +11.07%
===========================================
Files 241 182 -59
Lines 29560 18054 -11506
===========================================
- Hits 11951 9298 -2653
+ Misses 16700 8229 -8471
+ Partials 909 527 -382 see 109 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
|
I just pushed a couple small changes adding updates on cc @dtzar |
|
LGTM label has been added. DetailsGit tree hash: 0e3340eac8bb9c27333df20f45e2318541b27837 |
|
Officially starting lazy consensus on this, ending EOD 14 April (end of next week). |
There was a problem hiding this comment.
LGTM.
I think this summarizes the pros/cons of using ASO quite well.
I will leave the actual decision of if the pros outweigh the cons to you experts as I don't have great visibility into the costs/benefits for CAPZ as a project when comparing ASO to something like the track2 SDKs.
|
/lgtm I can’t add any more than what many others have said before me in these PR threads. Great work @nojnhuh! |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Great work! Kudos @nojnhuh! 🚀 |
|
Time for slash hold cancel? 🤠 |
|
/pony |
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind design
What this PR does / why we need it: This PR adds a proposal suggesting the adoption of Azure Service Operator in CAPZ to manage infrastructure in Azure instead of the Azure SDK.
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: