Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 15 additions & 9 deletions docs/proposals/0845-scheduler-architecture-proposal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The Scheduling Subsystem is a framework used to implement scheduling algorithms.

## Design Principles
- The scheduler framework should act as an independent library, there should be no dependency on EPP packages defined outside of the scheduler
- The *framework* should be agnostic to web protocols(such as HTTP), endpoint types (such as model servers), and K8s concepts.
- The *framework* should be agnostic to endpoint types (such as model servers), and K8s concepts.
- Opinions should be held by the plugins, not the framework
- The entry & exit points should be defined by the framework, acting as the API surface of the system
- Multiple scheduling 'profiles' should be able to be ran for a single request.
Expand Down Expand Up @@ -46,20 +46,28 @@ A sketch of the System, with extension points is here:

Describing the interface extension points & flow is the simplest way to convey the intent of what the framework should enable:

### ProfileSelect (or ProfilePick)
### ProfileHandler

ProfileHandler is a schedler plugin with two extension points - ProfilePick, and ProcessProfilesResults.
Below is a detailed explanation about these extension points.
Only a single ProfileHandler plugin may be defined per scheduler.

### ProfilePick

ProfilePick is the entry point into the scheduling cycle (called by the framework).
ProfileSelect, selects profiles conditionally based on:
it selects profiles conditionally based on:

- Request data
- Results of previously executed SchedulerProfiles
- Cycle State

ProfileSelect will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single ProfileSelect function may be defined per scheduler.
ProfilePick will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to return a boolean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally it would be good.
the point is that ProfilePick next iteration depends on the current profile execution result, so that's not always possible. for example in P/D (let's assume it's enabled):

  • on first iteration we run D and have the results.
  • on the second iteration we inspect the results of D and doing some check of prefix cache, and only then decide if P should run or not.

this means we cannot decide upfront on the boolean of whether additional iteration should be called or not.
this is the original reason I haven't added bool to this part.
see code example from llm-d here.
this code is going to change to use SchedulerProfiles instead of two schedulers, so this check should be embedded in the profile selection logic.

Copy link
Contributor

@ahg-g ahg-g Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a boolean, you always need an extra call to ProfilePick. With boolean, in some cases you will get to execute the right number of iterations, in others you will need an extra call that returns false with nil profile to execute.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowing multiple ProfileSelect functions to be defined per scheduler?

Additionally, the boolean makes it much more clear to the implementer how multiple selection is handled. Otherwise a user may skim the documentation and miss the line that specifies how multiple runs should be handled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we wanted a boolean "call again" as an explicit return, whereas an empty Alice is more of an implicit convention.
Both options are equally expressive:
call again == (len(profiles)==0)
The ProfilePick always gets called a second time in the example. The difference is only between the second call returning an empty slice or a boolean to indicate a third call is not needed.
At least that's the way I understand it...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They provide the same functionality, completely agreed there.

I'm just suggesting that a return type of ...) map[string]*SchedulerProfile vs ...) (selectedProfiles map[string]*SchedulerProfile, rerun bool) allows an implementer to more easily intuit the functionality of the system, or ask themselves the question 'what does the rerun bool do?' and then go to the documentation and learn. Vs the implicit method, it requires a user to be aware of a somewhat 'secret' functionality.

Sure we will document it, but I don't think all users read documentation exhaustively, and with the added benefit of no forced extra call to ProfilePick it seems slightly better to have a bool imo.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think bool return value here has an inconsistent behavior.
in some scenarios, we know that no extra call is needed.
in other scenarios like the one I described above, we will do an extra call but get back empty profiles map.
IMO it's better to define a behavior that is consistent with its definition in all cases, rather than having an inconsistent bool that may confuse more than help.

btw, I do expect whoever wants to implement a plugin to read its documentation.
the function godoc is only two lines, not an exhaustive description.

Are we allowing multiple ProfileSelect functions to be defined per scheduler?

only one ProfileHandler plugin is allowed, means only one ProfileSelect is allowed.
this is still not validated in code but is documented (should be also validated in next PRs).

ProfilePick extension point will be configured as part of a ProfileHandler plugin.
Since there is only a single ProfileHandler plugin, that means there is only a single ProfilePick function.

### Scheduler Profile Run

The SchedulerPprofile run consists of 3 defined phases `Filter`, `Score`, & `Pick`
The SchedulerProfile run consists of 3 defined phases `Filter`, `Score`, & `Pick`

*Profile Constraints*
- A profile can have any number of `Filter` plugins registered (including zero)
Expand All @@ -78,10 +86,8 @@ Picker selects the endpoint(s) from the provided list of scored endpoints. Picke


### ProcessProfilesResults
ProcessProfilesResults recieves the output of the result(s) of the scheduler profile(s) and makes sense of the data to be consumed by the calling system.

### PostResponse (Out of Scheduler and mentioned here for completeness only)
PostResponse is a special case extension that can optionally be implemented by a plugin that needs to augment its state based on response or request data. This should only be implemented for plugins that need to update state outside of the scheduling cycle. PostResponse is ran at the time of processing a response.
ProcessProfilesResults receives the output of the result(s) of the scheduler profile(s) and makes sense of the data to be consumed by the calling system.
Since there is only a single ProfileHandler plugin, that means there is only a single ProcessProfilesResults function.

## ConfigurationAPI
TODO
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,7 @@ type Endpoint struct {

type EndpointState struct {
// storage is per Scheduling Cycle, and so has no thread-safe concerns.
// TODO should think if the above is true or should we use sync map for thread safety.
storage map[string]any
storage map[string]any //nolint:unused
}

// Request is a structured representation of the fields we parse out of the Request body.
Expand Down Expand Up @@ -58,49 +57,57 @@ type Scheduler struct {
}

// SchedulerConfig is the struct that maps to the configuration file that should be further discussed.
// the configuration file should include the multi profile plugin as well as the profiles with their plugins.
// TODO should update the configuration file example.yaml to discuss its structure.
// the configuration file should include the ProfileHandler plugin as well as the profiles with their plugins.
type SchedulerConfig struct {
// exactly one MultiProfilePlugin instance is required.
multiProfilePlugin MultiProfilePlugin
// exactly one ProfileHandler instance is required.
profileHandler ProfileHandler //nolint:unused
// map from profile name to its set of plugins.
profiles map[string]*SchedulerProfile
profiles map[string]*SchedulerProfile //nolint:unused
}

// SchedulerProfile is used to describe a profile that will
// run for a given scheduling cycle.
type SchedulerProfile struct {
// Filters lists all Filter plugins associated with this Profile.
// Filters are optional.
filters []Filter
filters []Filter //nolint:unused
// Scorers lists all Score plugins associated with this Profile.
// Scorers are optional.
scorers []*WeightedScorer
scorers []*WeightedScorer //nolint:unused
// Picker returns the function that picks the endpoint(s). Picker is required.
picker Picker
picker Picker //nolint:unused
}

type SchedulingResult struct {
ProfileResults map[string][]*Endpoint // a map from profile name to its scheduling result
PrimaryProfileName string // key of the primary profile, its selected endpoints will be used by default as the destination
}

// Plugin is the parent type for all the scheduling framework plugins.
type Plugin interface {
Name() string
}

// MultiProfilePlugin defines the interface for handling multi SchedulerProfile instances.
type MultiProfilePlugin interface {
// ProfileHandler defines the interface for handling multi SchedulerProfile instances.
// More specifically, this interfaction defines two extension points, 'PickProfiles'
// which runs iteratively, and 'ProcessProfilesResults' which runs after all profiles runs complete
// and process the results of all profiles.
type ProfileHandler interface {
Plugin
// PickProfiles picks the SchedulingProfile objects to run from a list of candidate profiles,
// Pick picks the SchedulingProfile objects to run from a list of candidate profiles,
// while taking into consideration the request properties
// and the previously executed SchedluderProfile runs along with their results.
// returns:
// - profiles - A subset of the registered scheduling profiles to be ran in next iteration
PickProfiles(request *Request, profiles map[string]*SchedulerProfile, executionResults map[string][]*ScoredEndpoint) map[string]*SchedulerProfile
Pick(request *Request, profiles map[string]*SchedulerProfile, executionResults map[string][]*ScoredEndpoint) map[string]*SchedulerProfile

// ProcessProfileResults handles the outcome of each selected profile.
// It may aggregate results, log test profile outputs, or apply custom logic.
// For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile.
// ProcessResults handles the outcome of each profile run.
// It may aggregate results, log test profile outputs, or apply custom logic. It specifies in the SchedulingResult the
// key of the primary profile that should be used to get the request selected destination.
// Example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile.
// ProcessProfileResults would know to simply log the result of ShadowBoxing
// profile, and do nothing else with it.
ProcessProfileResults(request *Request, profileResults map[string][]*ScoredEndpoint) map[string][]*Endpoint
ProcessResults(request *Request, profileResults map[string][]*ScoredEndpoint) *SchedulingResult
}
Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a question I was thinking about-
since we agreed that Scheduler IS aware of http protocol (Request struct represents it), what are your thoughts about having ProcessResults get all profiles results but return only one?

in llm-d for example - since scheduler is aware of http, we can set the prefill header at this point and return only decode selection.
and in general I think it makes sense to return one Result, i.e., results of one profile (return []*Endpont and optionally array of backups)
@elevran @ahg-g

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that this will be done by the PostSchedule plugin. Here is my thinking of how we will compose with the default behavior:

This maintains the separation of concern and composes with the non p/d case nicely.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed everything except for the CycleState part, in which I probably didn't understand your intention.
in current Proposal ProcessResults signature is:

ProcessProfileResults(request *Request, profileResults map[string][]*ScoredEndpoint) map[string][]*Endpoint

looking on the return value here - we get back a map from profile-name -> it's results (set of endpoints).

let's assume we have PostSchedule in director (or as @liu-cong named it, PreRequest, which I agree is more descriptive/accurate, plus having PreRequest and PostResponse sounds to me like a better terminology than PostSchedule and PostResponse).
using this plugin we should be able to set the headers of destination (decode) + prefill.

can you explain what you meant in the CycleState part?

ProcessResults returns the endpoints for the decoder (and so DestinationEndpoint plugin will set them as per the epp<-> proxy protocol), while the prefill endpoint is stored in CycleState.

CycleState scope is within the scheduler, so it's not returning back to requestcontrol layer.

Copy link
Contributor

@ahg-g ahg-g Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking on the return value here - we get back a map from profile-name -> it's results (set of endpoints).

Oh, I thought we return a single final list that in the normal case a PreRequest plugin uses to set the destination endpoints.

CycleState scope is within the scheduler, so it's not returning back to requestcontrol layer.

can you explain what you meant in the CycleState part?

I assumed CycleState is scoped across all extension points executed for a request. If we do that, then what I was proposing is to use CycleState as the communication channel between the P/D specific ProcessResults and PreRequest plugins to communicate the list of prefill servers to be set in the headers. Again, this was assuming that ProcessResults returned a list of endpoints, not a map.

Another approach is to continue to have ProcessResult return a map, but the decode endpoints should have a key that the default PreRequest plugin understands for both the default and p/d case (in the p/d case, ProcessResult can set the key for the decode profile endpoints to whatever value expected by the default PreRequest plugin). And the addition PreRequest plugin operates on the prefill endpoints to set them as a header.

I am trying to avoid having the scheduling layer implement parts of the epp <-> proxy protocol.

In summary, we will have two PreRequest plugins: The first ships with IGE and implements the destination endpoint protocol, from the map[string][]*Endpoint it uses a well known key to lookup the list of endpoints it will use. The second ships with llm-d and implements the vllm protocol for setting the prefill endpoints as a header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the second approach, but I still think we should scope CycleState to be across all extension points executed on a request. We can define CycleState in the common pkg.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m also ok with cycle state per request, although I’d prefer avoiding it. I just think we need to pay special attention that the field is not abused, cause it becomes extremely easy to communicate information in a generic cycle state rather than well defined interfaces. I would have a cycle state per layer during the request lifecycle to verify the interfaces between the layers are well defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if we agree on this I think the proposal is ready for finalization, as we converged on everything

Yes, we have a few followups as well related to converting this into a generic extensibility proposal with subsections for various layers, defining a common pkg for types and interfaces, grouping the plugin implementations under one directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should scope CycleState to be across all extension points executed on a request

Just calling this out. But if we do this. We are stating that the EPP architecture is no longer made up of independent subsystems that could be broken out into their own lib, and instead tying them all together.

I don't particularly love that. The hope was that the director/request control would be the mortar where we would handle the grey area. But not willing to die on this hill. More just calling it out.

Copy link
Contributor

@ahg-g ahg-g Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with Kellen, I think we need to strike a balance between our goal of maintaining a reusable scheduling library and a design that works well across the layers we define in EPP.

My recommendation is to define common types, like CycleState and Request that all layers depend on. This doesn't prevent the scheduling library from being reusable in a different context.

I don't think we want to prevent plugins from implementing extensions across subsystems if it makes sense to do so. Each plugin should declare its dependencies, and if someone wants to use the scheduling library in a different context, then they can disable the plugins that does that if they are incompatible with their system, or implement in their system the behavior expected by those other extensions (e.g, adding state to CycleState that is expected by the scheduling parts of the plugin)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Kellen (at least with his initial comment) that it would be good NOT to have one CycleState per request, as it may cause some issues of bypassing interfaces that we define between the layers.
as I stated in previous comments:

I just think we need to pay special attention that the field is not abused, cause it becomes extremely easy to communicate information in a generic cycle state rather than well defined interfaces. I would have a cycle state per layer during the request lifecycle to verify the interfaces between the layers are well defined.

I agree that we need to balance as @ahg-g mentioned.
let's discuss this more as we make progress with more extension points in more layers.


// Filter runs before any scoring, and remove endpoints that are not fit for selection.
Expand All @@ -124,7 +131,7 @@ type Scorer interface {
// Using a map is much less convenient for this purpose.
type WeightedScorer struct {
Scorer
weight int
weight int //nolint:unused
}

// Picker selects the endpoint(s) from the provided list of scored endpoints.
Expand All @@ -133,9 +140,3 @@ type Picker interface {
Plugin
Pick(ctx context.Context, state *scheduling.CycleState, endpoints []*ScoredEndpoint) []*ScoredEndpoint
}

// PostResponse is NOT part of the scheduler subsystem but is specified here for completeness only.
type PostResponse interface {
Plugin
PostResponse(ctx context.Context, request *Request, response map[string]any)
}