sig-api-machinery: Add scale targets to CRDs to GA KEP #1015

jpbetz · 2019-04-29T18:21:16Z

This proposes that our scale targets for CRD to GA be based on API call latency SLIs/SLOs and kubernetes thresholds.

For details on scalability data gathered, see: https://docs.google.com/document/d/1tEstPQvzGvaRnN-WwGUWx1H9xHPRCy_fFcGlgTkB3f8/edit#

jpbetz · 2019-04-29T19:23:19Z

@roycaihw @liggitt @sttts We're still gathering some data, but early feedback welcome!

liggitt · 2019-04-29T19:53:12Z

cc @wojtek-t @kubernetes/sig-scalability-misc

keps/sig-api-machinery/20180415-crds-to-ga.md

liggitt · 2019-04-30T15:25:07Z

the feedback from @wojtek-t was helpful. I think the approach should be:

align with native SLI/SLO
make it clear the overhead of the conversion webhook is either explicitly excluded or capped at Xms per call (and make sure we can measure/compensate for that in the scale tests that ensure we hit our targets)
make the variables for our tests and claimed scale for a particular custom resource type:
- the resource size
- max number of custom resource instances per namespace
- max number of custom resource instances cluster-wide

liggitt · 2019-04-30T18:00:17Z

I'd still like to see real-world feedback inform those max instance size and count-per-namespace and count-per-cluster numbers.

In the absence of that, setting up initial scale targets along the lines of O(64KB), O(1000) per namespace, O(10000) per cluster seems reasonable in a hand-wavy way, given other count targets in https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

jpbetz · 2019-05-02T18:26:04Z

I've rewritten this section to complement https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md and https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md. The actual thresholds and max resource size are not finalized, but I've penciled in "O(10KB), O(2000) per namespace, O(10000) per cluster" and we can substitute in better value once we've nailed them down.

I agree read world feedback would be valuable, maybe there is a group we could survey? I'll ask around.

keps/sig-api-machinery/20180415-crds-to-ga.md

smarterclayton · 2019-05-02T20:06:19Z

Some known CRD scale factors:

Clusters with 100-200 CRDs (fixed base), many of which have 1-5 CR instances.
Average size of CRs 2KB, max 150KB (very large CRs which contain structured info that might represent interconnected components and chunks of text)
Some high churn CRDs like ProwJobs on a cluster with 10k-30k instances in a single namespace, frequent lists, infrequent gets and updates.
Large numbers of watches on many different CRDs for common / shared config. Example, cluster configuration represented as 30-40 CRD, of which there are 20+ controllers each having single item CR watches on some of those.

liggitt · 2019-05-02T21:01:20Z

Clusters with 100-200 CRDs (fixed base), many of which have 1-5 CR instances.

That's a good dimension (number of distinct custom resource types) we forgot to represent. The way the custom resource server is structured, each custom resource handler is independent, so this is no problem for the server to support, but is good to explicitly create/exercise O(100s) of CRDs in parallel in a test. The main impact I see for that would actually be on clients that pull down discovery docs for all group/version/resources, but that's not really specific to CRDs.

smarterclayton · 2019-05-02T21:39:20Z

yeah any O(N) CRD things should be caught quick. Or where there is a large constant factor per CRD (like openapi).

smarterclayton · 2019-05-02T21:42:41Z

For this cluster (reasonably representative of high CRD use):

$ kubectl get crd | wc -l
62
$ kubectl get crd -o json | jq -c '.' | wc -c
493222

Most of those CRDs have openapi specs with field values for explain and validation. I think it could get higher given that some may not have all the docs they need defined yet

wojtek-t

At the high-level, it looks fine. I would like to think more about the details, but in the worst case, I believe we could change them later too (I'm a bit overloaded now, and I don't want to block this).

keps/sig-api-machinery/20180415-crds-to-ga.md

liggitt · 2019-07-29T22:13:45Z

/lgtm

liggitt · 2019-07-29T22:14:04Z

/assign @lavalamp @deads2k

lavalamp · 2019-07-29T23:01:54Z

keps/sig-api-machinery/20180415-crds-to-ga.md

+can be found by taking the (median) object size of the custom resource and finding
+the the matching row in this table:
+
+| Object size | Suggested Maximum Limit: scope=namespace (5s p99 SLO) | Suggested Maximum Limit: scope=cluster (30s p99 SLO) |


Why does the cluster scope have a longer SLO?

This is to be consistent with how the SLOs for native types are defined (https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md#api-call-latency-slisslos-details). @wojtek-t, do you know the background?

It's connected with the number of objects - listing objects is basically some constant time + something proportional to number of objects processed. And within single namespace (scope=namespace) we offficially allow much smaller number of objects.

keps/sig-api-machinery/20180415-crds-to-ga.md

lavalamp · 2019-07-29T23:11:11Z

keps/sig-api-machinery/20180415-crds-to-ga.md

+
+The above object size and suggested maximum limits in the Custom Resources per
+Definition table applies to these conversion webhook SLOs. For example, for a
+list request for 1500 custom resource objects that are 10k in size, the resource


Is the table trying to say "return a single object in 50ms, up to 1500 10k objects in 1s, 10000 10k objects in 6 seconds"?

If so, that mathematically doesn't make much sense, the latter implies that a single object must be processed in .6ms?

I've inlined the object count details in the webhook SLOs to make it easier to understand. The 50ms for a single object is higher to account for a request serving constant factor. I've added a note.

keps/sig-api-machinery/20180415-crds-to-ga.md

lavalamp · 2019-07-29T23:41:35Z

/lgtm
/approve

k8s-ci-robot · 2019-07-29T23:41:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, lavalamp, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-api-machinery/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t

LGTM

wojtek-t · 2019-07-30T06:58:51Z

keps/sig-api-machinery/20180415-crds-to-ga.md

+can be found by taking the (median) object size of the custom resource and finding
+the the matching row in this table:
+
+| Object size | Suggested Maximum Limit: scope=namespace (5s p99 SLO) | Suggested Maximum Limit: scope=cluster (30s p99 SLO) |


It's connected with the number of objects - listing objects is basically some constant time + something proportional to number of objects processed. And within single namespace (scope=namespace) we offficially allow much smaller number of objects.

k8s-ci-robot requested review from deads2k and lavalamp April 29, 2019 18:21

k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Apr 29, 2019

deads2k mentioned this pull request Apr 29, 2019

CustomResourceDefinitions #95

Closed

liggitt reviewed Apr 29, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

wojtek-t reviewed Apr 30, 2019

View reviewed changes

jpbetz force-pushed the ga-kep-scalability-links branch 3 times, most recently from 0620d46 to edb4659 Compare May 2, 2019 18:49

liggitt reviewed May 2, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

liggitt reviewed May 2, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

liggitt reviewed May 2, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

liggitt reviewed May 2, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

jpbetz force-pushed the ga-kep-scalability-links branch from edb4659 to e191d18 Compare May 2, 2019 22:18

wojtek-t reviewed May 6, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

liggitt mentioned this pull request May 6, 2019

KEP for Graduating CronJob to GA #978

Closed

jpbetz force-pushed the ga-kep-scalability-links branch 3 times, most recently from 2ca9f99 to e618668 Compare July 29, 2019 22:08

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2019

k8s-ci-robot assigned deads2k and lavalamp Jul 29, 2019

lavalamp reviewed Jul 29, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Show resolved Hide resolved

lavalamp reviewed Jul 29, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

jpbetz force-pushed the ga-kep-scalability-links branch from e618668 to eda69e7 Compare July 29, 2019 23:08

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2019

lavalamp reviewed Jul 29, 2019

View reviewed changes

jpbetz force-pushed the ga-kep-scalability-links branch from eda69e7 to 0582af5 Compare July 29, 2019 23:30

lavalamp reviewed Jul 29, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

lavalamp reviewed Jul 29, 2019

View reviewed changes

keps/sig-api-machinery/20180415-crds-to-ga.md Outdated Show resolved Hide resolved

Add cluster total CR limit, explain CR per CRD limits in more detail

762a118

jpbetz force-pushed the ga-kep-scalability-links branch from 0582af5 to 762a118 Compare July 29, 2019 23:40

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2019

k8s-ci-robot merged commit b636feb into kubernetes:master Jul 29, 2019

wojtek-t reviewed Jul 30, 2019

View reviewed changes

roycaihw mentioned this pull request Jan 7, 2020

[Flaky test] [sig-api-machinery] CustomResourcePublishOpenAPI flakes kubernetes/kubernetes#86418

Closed

justaugustus removed the sig/pm label Apr 19, 2020

ulucinar mentioned this pull request Oct 25, 2021

API Server (and clients) becomes unresponsive with too many CRDs crossplane/crossplane#2649

Closed

roycaihw mentioned this pull request Oct 29, 2021

High CPU Utilization in kube-apiserver when a large number of CRDs are created kubernetes/kubernetes#105932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sig-api-machinery: Add scale targets to CRDs to GA KEP #1015

sig-api-machinery: Add scale targets to CRDs to GA KEP #1015

jpbetz commented Apr 29, 2019 •

edited

Loading

jpbetz commented Apr 29, 2019

liggitt commented Apr 29, 2019

liggitt commented Apr 30, 2019 •

edited

Loading

liggitt commented Apr 30, 2019

jpbetz commented May 2, 2019

smarterclayton commented May 2, 2019

liggitt commented May 2, 2019 •

edited

Loading

smarterclayton commented May 2, 2019

smarterclayton commented May 2, 2019 •

edited

Loading

wojtek-t left a comment

liggitt commented Jul 29, 2019

liggitt commented Jul 29, 2019

lavalamp Jul 29, 2019

jpbetz Jul 29, 2019

wojtek-t Jul 30, 2019

lavalamp Jul 29, 2019

jpbetz Jul 29, 2019

lavalamp commented Jul 29, 2019

k8s-ci-robot commented Jul 29, 2019

wojtek-t left a comment

wojtek-t Jul 30, 2019

sig-api-machinery: Add scale targets to CRDs to GA KEP #1015

sig-api-machinery: Add scale targets to CRDs to GA KEP #1015

Conversation

jpbetz commented Apr 29, 2019 • edited Loading

jpbetz commented Apr 29, 2019

liggitt commented Apr 29, 2019

liggitt commented Apr 30, 2019 • edited Loading

liggitt commented Apr 30, 2019

jpbetz commented May 2, 2019

smarterclayton commented May 2, 2019

liggitt commented May 2, 2019 • edited Loading

smarterclayton commented May 2, 2019

smarterclayton commented May 2, 2019 • edited Loading

wojtek-t left a comment

Choose a reason for hiding this comment

liggitt commented Jul 29, 2019

liggitt commented Jul 29, 2019

lavalamp Jul 29, 2019

Choose a reason for hiding this comment

jpbetz Jul 29, 2019

Choose a reason for hiding this comment

wojtek-t Jul 30, 2019

Choose a reason for hiding this comment

lavalamp Jul 29, 2019

Choose a reason for hiding this comment

jpbetz Jul 29, 2019

Choose a reason for hiding this comment

lavalamp commented Jul 29, 2019

k8s-ci-robot commented Jul 29, 2019

wojtek-t left a comment

Choose a reason for hiding this comment

wojtek-t Jul 30, 2019

Choose a reason for hiding this comment

jpbetz commented Apr 29, 2019 •

edited

Loading

liggitt commented Apr 30, 2019 •

edited

Loading

liggitt commented May 2, 2019 •

edited

Loading

smarterclayton commented May 2, 2019 •

edited

Loading