Skip to content

Commit b57fb8a

Browse files
committed
Merge remote-tracking branch 'upstream/main'
2 parents d5530dc + 4c319af commit b57fb8a

File tree

6 files changed

+227
-70
lines changed

6 files changed

+227
-70
lines changed

docs/proposals/0845-scheduler-architecture-proposal/README.md

Lines changed: 34 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,27 @@ The Scheduling Subsystem is a framework used to implement scheduling algorithms.
99

1010
## Design Principles
1111
- The scheduler framework should act as an independent library, there should be no dependency on EPP packages defined outside of the scheduler
12-
- The *framework* should be agnostic to web protocols(such as HTTP), endpoint types (such as model servers), and K8s concepts.
12+
- The *framework* should be agnostic to endpoint types (such as model servers), and K8s concepts.
1313
- Opinions should be held by the plugins, not the framework
1414
- The entry & exit points should be defined by the framework, acting as the API surface of the system
1515
- Multiple scheduling 'profiles' should be able to be ran for a single request.
1616
- They can be conditionally dependent on previous runs, or in parallel
17-
- Plugin state is managed by the plugin itself
17+
- State management
18+
- State per request: This is managed by what we are calling CycleState and its lifecycle is tied to the request.
19+
Cycle state is created internally by the Scheduler per request and its pointer is passed as argument.
20+
- State managed by the plugin struct itself: The lifecycle of this state is tied to the plugin, and since plugins will be instantiated once,
21+
it is a state that plugins can use across requests (like prefix-cache index).
22+
- State managed by the data layer: each endpoint will be associated with state (currently metrics) that a data layer plugin can add to it.
23+
A data layer plugin could be one that scrapes v1/models from the endpoint for example.
1824

1925
## Definitions
2026
- **Scheduling Framework** - The system created to allow for a pluggable scheduling algorithm.
21-
- **Scheduling Profile** - A named, specific set of Filter(s), Scorer(s), & Picker used to select endpoints.
22-
- **Scheduler** - An extensible implementation of a scheduling algorithm. Including logic to select Scheduling Profiles, the Scheduling Profiles themselves, & logic to interpret the result.
23-
- **Scheduling Cycle** - A single run of a Scheduler through the Scheduling Framework.
27+
- **Scheduler Profile** - A named, specific set of Filter(s), Scorer(s), & Picker used to select endpoints.
28+
- **Scheduler Profile Run** - a one time run of the Scheduler Profile filters, scorers and picker given a request.
29+
- **Scheduler** - An extensible implementation of a scheduling algorithm. Including logic to select Scheduler Profiles iteratively,
30+
the Scheduler Profiles themselves, & logic to interpret the result.
31+
- **Scheduling Cycle** - A single run of a Scheduler through the Scheduling Framework. a scheduling cycle includes one or
32+
more Scheduler Profile runs (at least one).
2433
- **Plugin** - Implementation of framework-defined interface(s) to add or extend logic across the framework.
2534

2635
## Proposal
@@ -33,23 +42,32 @@ The Scheduling System can loosely be defined into 3 sections:
3342
- A *configuration API* to define the Scheduler, Profile(s), & the plugins used within those profiles
3443

3544
A sketch of the System, with extension points is here:
36-
<img src="./images/scheduler_subsystem.svg" alt="Scheduling Algorithm" width="1000" />
45+
<img src="./images/scheduler_cycle.png" alt="Scheduling Algorithm" width="1000" />
3746

3847
Describing the interface extension points & flow is the simplest way to convey the intent of what the framework should enable:
3948

40-
### PreSchedule
49+
### ProfileHandler
4150

42-
PreSchedule is the entry point into the scheduling cycle (called by the framework). PreSchedule, selects profiles conditionally based on:
51+
ProfileHandler is a schedler plugin with two extension points - ProfilePick, and ProcessProfilesResults.
52+
Below is a detailed explanation about these extension points.
53+
Only a single ProfileHandler plugin may be defined per scheduler.
54+
55+
### ProfilePick
56+
57+
ProfilePick is the entry point into the scheduling cycle (called by the framework).
58+
it selects profiles conditionally based on:
4359

4460
- Request data
45-
- Results
61+
- Results of previously executed SchedulerProfiles
4662
- Cycle State
4763

48-
PreSchedule will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single PreSchedule function may be defined per scheduler.
64+
ProfilePick will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call.
65+
ProfilePick extension point will be configured as part of a ProfileHandler plugin.
66+
Since there is only a single ProfileHandler plugin, that means there is only a single ProfilePick function.
4967

50-
### Profile Cycle
68+
### Scheduler Profile Run
5169

52-
The profile cycle consists of 3 defined functions `Filter`, `Score`, & `Pick`
70+
The SchedulerProfile run consists of 3 defined phases `Filter`, `Score`, & `Pick`
5371

5472
*Profile Constraints*
5573
- A profile can have any number of `Filter` plugins registered (including zero)
@@ -61,17 +79,15 @@ The profile cycle consists of 3 defined functions `Filter`, `Score`, & `Pick`
6179
Filter runs before any scoring, and remove endpoints that are not fit for selection. The framework will return an error to the client if the endpoints are filtered to zero.
6280

6381
#### Score
64-
Score applies a score to each remaining endpoint provided. Scorers SHOULD keep their score values in a normalized range: [0-1]. Any weighting should be added at the SchedulingProfile configuration level.
82+
Score applies a score to each remaining endpoint provided. Scorers SHOULD keep their score values in a normalized range: [0-1]. Any weighting should be added at the SchedulerProfile configuration level.
6583

6684
#### Pick
6785
Picker selects the endpoint(s) from the provided list of scored endpoints. Picker MUST return, one endpoint at minimum.
6886

6987

70-
### PostSchedule
71-
PostSchedule receives the output of the result(s) of the scheduling cycle(s) and makes sense of the data to be consumed by the calling system.
72-
73-
### PostResponse
74-
PostResponse is a special case extension that can optionally be implemented by a plugin that needs to augment its state based on response or request data. This should only be implemented for plugins that need to update state outside of the scheduling cycle. PostResponse is ran at the time of processing a response.
88+
### ProcessProfilesResults
89+
ProcessProfilesResults receives the output of the result(s) of the scheduler profile(s) and makes sense of the data to be consumed by the calling system.
90+
Since there is only a single ProfileHandler plugin, that means there is only a single ProcessProfilesResults function.
7591

7692
## ConfigurationAPI
7793
TODO
329 KB
Loading

docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go

Lines changed: 88 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -22,82 +22,121 @@ import (
2222
scheduling "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types"
2323
)
2424

25-
// READER NOTE: Currently CycleState is assumed to have appropriate request data rather that making a new object.
26-
27-
// Plugin is the parent type for all the scheduling framework plugins.
28-
type Plugin interface {
29-
Name() string
30-
}
31-
3225
type Endpoint struct {
3326
State EndpointState
34-
Score float64
3527
}
3628

3729
type EndpointState struct {
3830
// storage is per Scheduling Cycle, and so has no thread-safe concerns.
3931
storage map[string]any //nolint:unused
4032
}
4133

42-
type SchedulingResult struct {
43-
results map[string][]Endpoint //nolint:unused
34+
// Request is a structured representation of the fields we parse out of the Request body.
35+
type Request struct {
36+
// RequestId is the Envoy generated Id for the request being processed
37+
RequestId string
38+
// TargetModel is the final target model after traffic split.
39+
TargetModel string
40+
// Prompt is the prompt that was sent in the request body.
41+
Prompt string
42+
// Headers is a map of the request headers.
43+
Headers map[string]string
4444
}
4545

46-
// Scheduler is the implementation of a... scheduler.
47-
// The scheduler object is created at startup using the provided configuration.
48-
type Scheduler interface {
49-
// PreSchedule selects scheduling profiles through the implemented
50-
// logic, and returns:
51-
// - profiles - A subset of the registered scheduling profiles to be ran
52-
PreSchedule(request map[string]any, data scheduling.CycleState, results map[string][]Endpoint) map[string]SchedulingProfile
53-
54-
// PostSchedule receives the output of the result(s) of the scheduling cycle(s)
55-
// and makes sense of the data to be consumed by the calling system.
56-
// For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile.
57-
// PostSchedule would know to simply log the result of ShadowBoxing
58-
// profile, and do nothing else with it.
59-
PostSchedule(profileResults map[string][]Endpoint) SchedulingResult
46+
// ScoredEndpoint encapsulates Endpoint with its Score.
47+
// The lifecycle of an endpoint is typically different than a lifecycle of a request.
48+
// This is intended to be used only internally by Scheduler logic and/or scheduler plugins within the lifecycle of the request.
49+
// When returning the selected Endpoint(s) out of the Scheduler, an Endpoint is returned without the score.
50+
type ScoredEndpoint struct {
51+
Endpoint
52+
Score float64
53+
}
54+
55+
type Scheduler struct {
56+
SchedulerConfig
6057
}
6158

62-
// SchedulingProfile is used to describe a profile that will
59+
// SchedulerConfig is the struct that maps to the configuration file that should be further discussed.
60+
// the configuration file should include the ProfileHandler plugin as well as the profiles with their plugins.
61+
type SchedulerConfig struct {
62+
// exactly one ProfileHandler instance is required.
63+
profileHandler ProfileHandler //nolint:unused
64+
// map from profile name to its set of plugins.
65+
profiles map[string]*SchedulerProfile //nolint:unused
66+
}
67+
68+
// SchedulerProfile is used to describe a profile that will
6369
// run for a given scheduling cycle.
64-
type SchedulingProfile struct {
65-
// Name of the profile.
66-
Name string
67-
// Filters lists all Filter plugins associated with this Profile. Filters
68-
// are optional.
69-
Filters []Filter
70-
// Scorers lists all Score plugins associated with this Profile. Scorers
71-
// are optional.
72-
Scorers map[Scorer]int
70+
type SchedulerProfile struct {
71+
// Filters lists all Filter plugins associated with this Profile.
72+
// Filters are optional.
73+
filters []Filter //nolint:unused
74+
// Scorers lists all Score plugins associated with this Profile.
75+
// Scorers are optional.
76+
scorers []*WeightedScorer //nolint:unused
7377
// Picker returns the function that picks the endpoint(s). Picker is required.
74-
Picker Picker
78+
picker Picker //nolint:unused
7579
}
7680

77-
// Filter runs before any scoring, and remove endpoints that are not fit for
78-
// selection. The framework will return an error to the client if the endpoints
79-
// are filtered to zero.
81+
type SchedulingResult struct {
82+
ProfileResults map[string][]*Endpoint // a map from profile name to its scheduling result
83+
PrimaryProfileName string // key of the primary profile, its selected endpoints will be used by default as the destination
84+
}
85+
86+
// Plugin is the parent type for all the scheduling framework plugins.
87+
type Plugin interface {
88+
Name() string
89+
}
90+
91+
// ProfileHandler defines the interface for handling multi SchedulerProfile instances.
92+
// More specifically, this interfaction defines two extension points, 'PickProfiles'
93+
// which runs iteratively, and 'ProcessProfilesResults' which runs after all profiles runs complete
94+
// and process the results of all profiles.
95+
type ProfileHandler interface {
96+
Plugin
97+
// Pick picks the SchedulingProfile objects to run from a list of candidate profiles,
98+
// while taking into consideration the request properties
99+
// and the previously executed SchedluderProfile runs along with their results.
100+
// returns:
101+
// - profiles - A subset of the registered scheduling profiles to be ran in next iteration
102+
Pick(request *Request, profiles map[string]*SchedulerProfile, executionResults map[string][]*ScoredEndpoint) map[string]*SchedulerProfile
103+
104+
// ProcessResults handles the outcome of each profile run.
105+
// It may aggregate results, log test profile outputs, or apply custom logic. It specifies in the SchedulingResult the
106+
// key of the primary profile that should be used to get the request selected destination.
107+
// Example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile.
108+
// ProcessProfileResults would know to simply log the result of ShadowBoxing
109+
// profile, and do nothing else with it.
110+
ProcessResults(request *Request, profileResults map[string][]*ScoredEndpoint) *SchedulingResult
111+
}
112+
113+
// Filter runs before any scoring, and remove endpoints that are not fit for selection.
114+
// The framework will return an error to the client if the endpoints are filtered to zero.
80115
type Filter interface {
81116
Plugin
82-
Filter(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
117+
Filter(ctx context.Context, request *Request, state *scheduling.CycleState, endpoints []*Endpoint) []*Endpoint
83118
}
84119

85-
// Scorer applies a score to each remaining endpoint provided. Scorers SHOULD
86-
// keep their score values in a normalized range: [0-1]. Any weighting should
87-
// be added at the SchedulingProfile configuration level.
120+
// Scorer applies a score to each remaining endpoint provided.
121+
// Scorers SHOULD keep their score values in a normalized range: [0-1].
122+
// Any weighting should be added at the SchedulerProfile configuration level.
88123
type Scorer interface {
89124
Plugin
90-
Score(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
125+
Score(ctx context.Context, request *Request, state *scheduling.CycleState, endpoints []*Endpoint) []*ScoredEndpoint
126+
}
127+
128+
// WeightedScorer is a struct that encapsulates a scorer with its weight.
129+
// We need this struct in order to be able to keep scorers in profile as a slice instead of a map.
130+
// This is very useful for having a generic AddPlugin function that registers a plugin to all its extension points.
131+
// Using a map is much less convenient for this purpose.
132+
type WeightedScorer struct {
133+
Scorer
134+
weight int //nolint:unused
91135
}
92136

93137
// Picker selects the endpoint(s) from the provided list of scored endpoints.
94138
// Picker MUST return, one endpoint at minimum.
95139
type Picker interface {
96140
Plugin
97-
Pick(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
98-
}
99-
100-
type PostResponse interface {
101-
Plugin
102-
PostResponse(ctx context.Context, request map[string]any, response map[string]any)
141+
Pick(ctx context.Context, state *scheduling.CycleState, endpoints []*ScoredEndpoint) []*ScoredEndpoint
103142
}

site-src/contributing/index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,5 +45,4 @@ doc. Feel free to add topics for discussion at an upcoming meeting.
4545

4646
All meetings are recorded and automatically uploaded to the [WG-Serving meetings
4747
YouTube
48-
playlist][https://www.youtube.com/playlist?list=PL69nYSiGNLP30qNanabU75ayPK7OPNAAS].
49-
48+
playlist](https://www.youtube.com/playlist?list=PL69nYSiGNLP30qNanabU75ayPK7OPNAAS).

site-src/guides/metrics.md

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,4 +93,69 @@ TOKEN=$(kubectl -n default get secret inference-gateway-sa-metrics-reader-secret
9393
kubectl -n default port-forward inference-gateway-ext-proc-pod-name 9090
9494
9595
curl -H "Authorization: Bearer $TOKEN" localhost:9090/metrics
96-
```
96+
```
97+
98+
## Prometheus Alerts
99+
100+
The section instructs how to configure prometheus alerts using collected metrics.
101+
102+
### Configure alerts
103+
104+
You can follow this [blog post](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/) for instruction of setting up alerts in your monitoring stacks with Prometheus.
105+
106+
A template alert rule is available at [alert.yaml](../../tools/alerts/alert.yaml). You can modify and append these rules to your existing Prometheus deployment.
107+
108+
#### High Inference Request Latency P99
109+
110+
```
111+
alert: HighInferenceRequestLatencyP99
112+
expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0 # Adjust threshold as needed (e.g., 10.0 seconds)
113+
for: 5m
114+
annotations:
115+
title: 'High latency (P99) for model {{ $labels.model_name }}'
116+
description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
117+
labels:
118+
severity: 'warning'
119+
```
120+
121+
#### High Inference Error Rate
122+
123+
```
124+
alert: HighInferenceErrorRate
125+
expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05 # Adjust threshold as needed (e.g., 5% error rate)
126+
for: 5m
127+
annotations:
128+
title: 'High error rate for model {{ $labels.model_name }}'
129+
description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
130+
labels:
131+
severity: 'critical'
132+
impact: 'availability'
133+
```
134+
135+
#### High Inference Pool Queue Average Size
136+
137+
```
138+
alert: HighInferencePoolAvgQueueSize
139+
expr: inference_pool_average_queue_size > 50 # Adjust threshold based on expected queue size
140+
for: 5m
141+
annotations:
142+
title: 'High average queue size for inference pool {{ $labels.name }}'
143+
description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
144+
labels:
145+
severity: 'critical'
146+
impact: 'performance'
147+
```
148+
149+
#### High Inference Pool Average KV Cache
150+
151+
```
152+
alert: HighInferencePoolAvgKVCacheUtilization
153+
expr: inference_pool_average_kv_cache_utilization > 0.9 # 90% utilization
154+
for: 5m
155+
annotations:
156+
title: 'High KV cache utilization for inference pool {{ $labels.name }}'
157+
description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
158+
labels:
159+
severity: 'critical'
160+
impact: 'resource_exhaustion'
161+
```

tools/alerts/alert.yaml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
groups:
2+
- name: gateway-api-inference-extension
3+
rules:
4+
- alert: HighInferenceRequestLatencyP99
5+
annotations:
6+
title: 'High latency (P99) for model {{ $labels.model_name }}'
7+
description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
8+
expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0
9+
for: 5m
10+
labels:
11+
severity: 'warning'
12+
- alert: HighInferenceErrorRate
13+
annotations:
14+
title: 'High error rate for model {{ $labels.model_name }}'
15+
description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
16+
expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05
17+
for: 5m
18+
labels:
19+
severity: 'critical'
20+
impact: 'availability'
21+
- alert: HighInferencePoolAvgQueueSize
22+
annotations:
23+
title: 'High average queue size for inference pool {{ $labels.name }}'
24+
description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
25+
expr: inference_pool_average_queue_size > 50
26+
for: 5m
27+
labels:
28+
severity: 'critical'
29+
impact: 'performance'
30+
- alert: HighInferencePoolAvgKVCacheUtilization
31+
annotations:
32+
title: 'High KV cache utilization for inference pool {{ $labels.name }}'
33+
description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
34+
expr: inference_pool_average_kv_cache_utilization > 0.9
35+
for: 5m
36+
labels:
37+
severity: 'critical'
38+
impact: 'resource_exhaustion'

0 commit comments

Comments
 (0)