-
Notifications
You must be signed in to change notification settings - Fork 499
KEP-4136: Admission Fair Sharing #4252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,255 @@ | ||||||||||||||
| # KEP-4136: Admission Fair Sharing | ||||||||||||||
|
|
||||||||||||||
| <!-- toc --> | ||||||||||||||
| - [Summary](#summary) | ||||||||||||||
| - [Motivation](#motivation) | ||||||||||||||
| - [Goals](#goals) | ||||||||||||||
| - [Non-Goals](#non-goals) | ||||||||||||||
| - [Proposal](#proposal) | ||||||||||||||
| - [User Stories (Optional)](#user-stories-optional) | ||||||||||||||
| - [Story 1](#story-1) | ||||||||||||||
| - [Story 2](#story-2) | ||||||||||||||
| - [Risks and Mitigations](#risks-and-mitigations) | ||||||||||||||
| - [Design Details](#design-details) | ||||||||||||||
| - [Test Plan](#test-plan) | ||||||||||||||
| - [Prerequisite testing updates](#prerequisite-testing-updates) | ||||||||||||||
| - [Unit Tests](#unit-tests) | ||||||||||||||
| - [Integration tests](#integration-tests) | ||||||||||||||
| - [Graduation Criteria](#graduation-criteria) | ||||||||||||||
| - [Drawbacks](#drawbacks) | ||||||||||||||
| - [Alternatives](#alternatives) | ||||||||||||||
| <!-- /toc --> | ||||||||||||||
|
|
||||||||||||||
| ## Summary | ||||||||||||||
|
|
||||||||||||||
| This KEP describes the mechanism for fair admission of workloads coming from a group of | ||||||||||||||
| different sources (like Cluster and Local Queues) based on the source shared resource usage. | ||||||||||||||
| Workloads from sources that use less are admitted before workloads coming from sources that use more. | ||||||||||||||
|
|
||||||||||||||
| ## Motivation | ||||||||||||||
|
|
||||||||||||||
| Currently Kueue has a Fair Sharing mechanism that enforces fair-sharing of unused resources | ||||||||||||||
| via preemption. If one Cluster Queue is using much more of the resources than the other one | ||||||||||||||
| that is in need, some workloads from the first one may be preempted to allow more “fair” distribution. | ||||||||||||||
|
|
||||||||||||||
| This model has multiple assumptions: | ||||||||||||||
|
|
||||||||||||||
| * Workloads can be preempted. | ||||||||||||||
| * Users get a proper quota to do their regular business. | ||||||||||||||
| * Users come from multiple teams/organizations and they would rather have a strict policy but | ||||||||||||||
| fair policy than show understanding to those consuming all shared resources. | ||||||||||||||
|
|
||||||||||||||
| However, these assumptions are not universal. Sometimes: | ||||||||||||||
|
|
||||||||||||||
| * Workloads should not be preempted. | ||||||||||||||
| * Users operate on a shared but bigger quota. | ||||||||||||||
| * Fairness should not win over getting the workloads eventually completed. | ||||||||||||||
|
|
||||||||||||||
| In that case the existing mode doesn’t work and a different one needs to be employed. | ||||||||||||||
|
|
||||||||||||||
| ### Goals | ||||||||||||||
|
|
||||||||||||||
| * Establish a method for how shared resource usage is calculated and recorded and how users can fine tune the mechanism. | ||||||||||||||
| * Allow to specify a fair admission scope at either individual Cluster Queue or Cohort scope. | ||||||||||||||
| * Allow to specify the relative importance of LocalQueues targeting the same ClusterQueue. | ||||||||||||||
| * Amend the admission mechanism to work on admission scopes instead of only on ClusterQueues. | ||||||||||||||
| * Select the appropriate admission candidates for each of the admission scopes and admit them according to the selected queueing policy. | ||||||||||||||
| * Make the new mechanism complementary to the existing preemption-based fair sharing | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| ### Non-Goals | ||||||||||||||
|
|
||||||||||||||
| * Store time series data inside K8S. | ||||||||||||||
| * Provide precise shared resource usage accounting or billing. | ||||||||||||||
|
|
||||||||||||||
| ## Proposal | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| * Modify CQ’s FairSharing struct with | ||||||||||||||
|
|
||||||||||||||
| ```go | ||||||||||||||
| type FairSharing struct { | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| // Weight denotes how important the given queue when competing against other queues | ||||||||||||||
| // for unused shared resources. The exact impact of the weight in fair share calculations | ||||||||||||||
| // depends on the fair share algorithm used. Default = 1. | ||||||||||||||
| Weight *resource.Quantity `json:"weight,omitempty"` | ||||||||||||||
| } | ||||||||||||||
| ``` | ||||||||||||||
|
Comment on lines
+66
to
+75
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, we already have this field kueue/apis/kueue/v1beta1/clusterqueue_types.go Lines 128 to 132 in 636d57a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, there is. I change the meaning a bit to be more generic.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. We will change only the meaning, and not change the API's looking. Thanks. |
||||||||||||||
|
|
||||||||||||||
| * Expand LQ spec with the same FairSharing struct (Cohort will be expanded with FairSharing | ||||||||||||||
| as a part of the Preemptive FS implementation in hierarchical structure). | ||||||||||||||
|
|
||||||||||||||
| * Modify FairSharingStatus struct with | ||||||||||||||
|
|
||||||||||||||
| ```go | ||||||||||||||
| type FairSharingStatus struct { | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| // WeightShare represents the usage above nominal quota, with the weight applied in. | ||||||||||||||
mwielgus marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
| // The bigger the value is the more shared resources has been allocated and the less | ||||||||||||||
| // entitled the queue is for more shared resources. | ||||||||||||||
| // The exact details and the interpretation of the value depends on | ||||||||||||||
| // the fair sharing algorithm used. | ||||||||||||||
| WeightedShare int64 `json:"weightedShare"` | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| // ConsumedResources represents the aggregated usage of resources over time, | ||||||||||||||
| // with decaying function applied. | ||||||||||||||
| // The value is populated if usage consumption functionality is enabled in Kueue config. | ||||||||||||||
| ConsumedResources map[corev1.ResourceName]resource.Quantity `json:"consumedResources,omitempty"` | ||||||||||||||
mwielgus marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| // LastUpdate is the time when share and consumed resources were updated. | ||||||||||||||
| LastUpdate metav1.Time `json:"lastUpdate,omitempty"` | ||||||||||||||
|
Comment on lines
+97
to
+98
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use conditions on each resource instead of this dedicated
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Conditions have last transition, not last update. Here there is no transition, just updates.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. You mean this is the time when kueue takes a snapshot of consumed usage. WDYT?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Personally I see |
||||||||||||||
| } | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| * Add FairSharingStatus to LocalQueue (and Cohort). | ||||||||||||||
|
|
||||||||||||||
| * Create a new struct AdmissionScope and make it an optional field for CQ and Cohort Spec. If | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| not provided, CQ or Cohort is not considered an AdmissionScope and is not a subject for new | ||||||||||||||
| admission logic. If there are two AdmissionScopes on the path from CQ/Cohort to the top of | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why? Let's make this invalid state, as the lower scope is doing nothing?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because someone may be changing the scope. Scope change is not atomic and we cannot block the entire hierarchy in the meantime. |
||||||||||||||
| the hierarchy tree, the higher one is used. | ||||||||||||||
|
|
||||||||||||||
| ```go | ||||||||||||||
| const ( | ||||||||||||||
| // FairSharing based on usage, with QueuingStrategy as defined in CQ. | ||||||||||||||
| UsageBasedAdmissionFairSharing AdmissionMode = UsageBasedFairSharing | ||||||||||||||
|
|
||||||||||||||
| NoAdmissionFairSharing AdmissionMode = NoAdmissionFairSharing | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| ) | ||||||||||||||
|
|
||||||||||||||
| type AdmissionScope struct { | ||||||||||||||
| AdmissionMode AdmissionMode | ||||||||||||||
| } | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| * When selecting candidates for admission groups all workloads from LQ to CQ or the | ||||||||||||||
| topmost Cohort that is marked with AdmissionScope. Then sort them using criterias: | ||||||||||||||
|
|
||||||||||||||
| - Usage vector built from ConsumedResources from TopCohort to LQ | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| - Priority | ||||||||||||||
| - Timestamp | ||||||||||||||
|
|
||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| The usage vector will be sum of the `ConsumedResources` weighted according to `resourceWeights` | ||||||||||||||
| (mentioned later in the KEP) | ||||||||||||||
|
|
||||||||||||||
| Let’s look at a couple scenarios: | ||||||||||||||
|
|
||||||||||||||
| 1. AdmissionScope at CQ, CQ queueing policy is FIFO, 3 LQ pointing to CQ. Kueue considers | ||||||||||||||
| all CQ resources and potentially borrowed resources as “shared” resources and fair sharing | ||||||||||||||
| is applied to all workloads. | ||||||||||||||
| Kueue sorts the workloads by their LQ usage (if mode is `UsageBasedFairSharing`), priority and | ||||||||||||||
| timestamp and tries to admit the first one from the list. Other workloads are not attempted | ||||||||||||||
| until the first one is not admitted. | ||||||||||||||
|
|
||||||||||||||
| 2. AdmissionScope at CQ, CQ queueing policy is BestEffort, 3 LQ pointing to CQ. | ||||||||||||||
| Kueue considers all CQ resources and potentially borrowed resources as “shared” resources and | ||||||||||||||
| fair sharing is applied to all workloads. | ||||||||||||||
|
|
||||||||||||||
| Kueue sorts the workloads by their LQ usage (if mode is `UsageBasedFairSharing`), priority and | ||||||||||||||
| timestamp and tries to admit the first one from the list. If it fails and the second, third | ||||||||||||||
| or following is possible then that workload is admitted, under condition that it might get preempted. | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| 3. AdmissionScope at Cohort level - Kueue operates in a mixed mode. Inside CQ workloads are | ||||||||||||||
| selected according to their AdmissionMode (if specified). If a workload fits entirely into | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Only highest
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The candidates inside individual CQs are selected based on the specified logic and bubbled up. |
||||||||||||||
| nominal quota, then it is admitted immediately, if not it goes into cohort-level fair sharing. | ||||||||||||||
| For Cohort we select all the “sticking out” workloads, and sort them by their CQ usage, priority | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| and timestamp. Kueue attempts to admit the first workload from the list of sticking-out | ||||||||||||||
| (workload + current_usage > resources), just like if it was one big strict FIFO queue. | ||||||||||||||
| For multi-level hierarchy under one AdmissionScope we would treat underlying Cohorts as Fifo CQs. | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| * Additionally there will be the resource usage calculation loop. The frequency of the | ||||||||||||||
| calculation will be controlled globally in Kueue’s config file. Accounting would be done | ||||||||||||||
| using something like geometric average: | ||||||||||||||
|
|
||||||||||||||
| usage_sum = (1-A) * previous_usage_sum + A * current_usage. | ||||||||||||||
|
|
||||||||||||||
| The value will be stored in FairSharingStatus for all LQ, CQ, and Cohorts. The value will not be zeroed | ||||||||||||||
| after Kueue restart or after brief period of downtime. However if the period is longer, | ||||||||||||||
| the value should be automatically zeroed. | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| The user will be able to configure the decaying factor A in Kueue’s config file by specifying | ||||||||||||||
| the half life decay time - after what time the current shared usage will decay to half of its original value. | ||||||||||||||
|
|
||||||||||||||
| A = 1 - 0.5 ^ (sampling/half_life_decay) | ||||||||||||||
|
|
||||||||||||||
| * Configuration will sit in FairSharing stuct in Kueue config. There will be the following modifications: | ||||||||||||||
|
|
||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| - usageHalfLifeDecayTime - half life decay time of usage, as described above. | ||||||||||||||
| - usageSamplingInterval - how often usage is calculated. | ||||||||||||||
| - resourceWeights - how much consumption of individual resources is important when comparing usage. | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| - resetInactivityPeriod - if Kueue has not updated the value for this period then the value should be zeroed. | ||||||||||||||
|
|
||||||||||||||
| If the user doesn't want any preemptions while fair sharing, preemptionStrategies should be left empty. | ||||||||||||||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| * If preemptionStrategies is non empty Kueue attempts to combine two fair sharings at the same time. | ||||||||||||||
| For each of the Admission Scopes Kueue selects one workload to be attempted. And then these pre-selected | ||||||||||||||
| workloads are sorted based on their Preemption-based fair share value. If some of them don't fit, | ||||||||||||||
| fair sharing preemption may be executed. So admission-based fair sharing only reshuffles workloads | ||||||||||||||
| within AdmissionScope and then other mechanisms are applied as usual. | ||||||||||||||
|
|
||||||||||||||
| ### User Stories (Optional) | ||||||||||||||
| #### Story 1 | ||||||||||||||
|
|
||||||||||||||
| I have multiple users using the same ClusterQueue. Each has its own namespace and LocalQueue | ||||||||||||||
| through which they submit workloads. I want to fairly admit their workloads so that one active | ||||||||||||||
| user doesn’t block the cluster too much. | ||||||||||||||
|
|
||||||||||||||
| #### Story 2 | ||||||||||||||
|
|
||||||||||||||
| I have multiple teams that may be sending workloads of various sizes. I want to give each team | ||||||||||||||
| some guaranteed capacity and at the same time, allow them to fairly share some bigger pool of resources. | ||||||||||||||
|
|
||||||||||||||
| ### Risks and Mitigations | ||||||||||||||
|
|
||||||||||||||
| * Having 2 fair sharing mechanisms and confusion between preemption-based fair sharing and admission time fair sharing. | ||||||||||||||
|
|
||||||||||||||
| * Increased complexity of the project. | ||||||||||||||
|
|
||||||||||||||
| ## Design Details | ||||||||||||||
|
|
||||||||||||||
| Covered in Proposal. | ||||||||||||||
|
|
||||||||||||||
| ### Test Plan | ||||||||||||||
| [x] I/we understand the owners of the involved components may require updates to | ||||||||||||||
| existing tests to make this code solid enough prior to committing the changes necessary | ||||||||||||||
| to implement this enhancement. | ||||||||||||||
| ##### Prerequisite testing updates | ||||||||||||||
|
|
||||||||||||||
| #### Unit Tests | ||||||||||||||
| The code will be thoroughly covered with unit tests. | ||||||||||||||
|
|
||||||||||||||
| In particular: | ||||||||||||||
| * CQ level fair sharing between LQ for both strict FIFO and best effort. | ||||||||||||||
| * Cohort with owned resources, and CQ with guaranteed quota. | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| #### Integration tests | ||||||||||||||
|
|
||||||||||||||
| They will mainly focus on larger scope scheduling (involving multiple cohorts/cqs) and | ||||||||||||||
| interactions with preemptive fair sharing outside admission scope. | ||||||||||||||
|
|
||||||||||||||
| ### Graduation Criteria | ||||||||||||||
|
|
||||||||||||||
| The implementation will be split into 2 subfeatures: | ||||||||||||||
|
|
||||||||||||||
| * CQ + LQ level support | ||||||||||||||
| * Multi-level Cohort+CQ+LQ support | ||||||||||||||
|
|
||||||||||||||
| Obviously, the second depends on the first to some extent. The first however may reach | ||||||||||||||
| Beta/GA without starting the second. | ||||||||||||||
|
|
||||||||||||||
| The graduation criterias are quite standard: | ||||||||||||||
|
|
||||||||||||||
| * Beta - positive feedback from Alpha, api seems reasonable. | ||||||||||||||
| * GA - positive feedback, no bugs, no api changes needed. | ||||||||||||||
|
|
||||||||||||||
| We hope to have CQ+LQ in alpha for the next Kueue release (0.12). | ||||||||||||||
|
|
||||||||||||||
| ## Drawbacks | ||||||||||||||
| * Adds additional complexity to the system. | ||||||||||||||
| * Creates yet another fair sharing mechanism. | ||||||||||||||
|
|
||||||||||||||
| ## Alternatives | ||||||||||||||
|
|
||||||||||||||
| * Not having the feature. | ||||||||||||||
| * Modifying/replacing the existing preemptive fair sharing algorithm. | ||||||||||||||
|
|
||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| title: Admission fair sharing | ||
| kep-number: 4136 | ||
| authors: | ||
| - "@mwielgus" | ||
| status: draft | ||
| creation-date: 2025-02-03 | ||
| reviewers: | ||
| - "@mimowo" | ||
mwielgus marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - "@gabesaba" | ||
| - "@pbundyra" | ||
| approvers: | ||
| - "@mimowo" | ||
| - "@tenzen-y" | ||
| sea-also: | ||
| - "KEP-1714" | ||
|
|
||
| # The target maturity stage in the current dev cycle for this KEP. | ||
| stage: alpha | ||
|
|
||
| # The most recent milestone for which work toward delivery of this KEP has been | ||
| # done. This can be the current (upcoming) milestone, if it is being actively | ||
| # worked on. | ||
| latest-milestone: "v0.12" | ||
|
|
||
| # The milestone at which this feature was, or is targeted to be, at each stage. | ||
| milestone: | ||
| alpha: "v0.12" | ||
|
|
||
| # The following PRR answers are required at alpha release | ||
| # List the feature gate name and the components for which it must be enabled | ||
| disable-supported: true | ||
|
|
||
| # The following PRR answers are required at beta release | ||
| # metrics: | ||
| # - my_feature_metric | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.