Skip to content

Conversation

@rahulgurnani
Copy link
Contributor

@rahulgurnani rahulgurnani commented Oct 31, 2025

Add PrepareData and AdmitRequest plugins based on recommendations in evolving datalayer changes

The prepare data plugins are executed sequentially in this PR.
Furthermore, the prefix cache match plugin is updated to implement the PrepareRequestData plugin by making minimal changes.

In a follow up PR, the prepare data plugins would be executed in order of dependency graph and validated on startup for cycles. Also, the prefix cache match plugin will be split into a separate scorer in future.

The PR also does some refactor of director.go code.

cc @BenjaminBraunDev

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR is needed to make it easier to implement plugins that produce data consumed by other plugins. For instance latency predictor, prefix match plugin etc.

Read the doc evolving datalayer changes
for more details

Which issue(s) this PR fixes:

Addresses #1743

Does this PR introduce a user-facing change?:

Yes, enables writing prepare data and admit request plugins for users of IGW.

Add prepare data and admit request plugins

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 31, 2025
@netlify
Copy link

netlify bot commented Oct 31, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 471c37e
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6914d639f4d2dc00082f2b93
😎 Deploy Preview https://deploy-preview-1796--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 31, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rahulgurnani
Once this PR has been reviewed and has the lgtm label, please assign nirrozenbaum for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and elevran October 31, 2025 16:22
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @rahulgurnani. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 31, 2025
@kfswain
Copy link
Collaborator

kfswain commented Oct 31, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 31, 2025
loggerDebug := log.FromContext(ctx).V(logutil.DEBUG)
for _, plugin := range d.requestControlPlugins.admitRequestPlugins {
loggerDebug.Info("Running AdmitRequest plugin", "plugin", plugin.TypedName())
if !plugin.Admit(ctx, request, pods) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow the plugin to return a string explaining the reason for rejection. We can then just treat the empty string as the allow mechanism (less opinionated on this part tho).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!


// Attributes provides a goroutine-safe implementation of AttributeMap.
type Attributes struct {
data sync.Map // key: attribute name (string), value: attribute value (opaque, Cloneable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically I am all for using prebuild libs to handle this type of complexity.

But since writes to specific attributes will lock the entire data object, we may have high lock contention here. Did we explore having a lock per attribute key?

That would let locks be at the granularity of a specific endpoint & a specific attribute, which should hopefully reduce lock contention & let our system be more performant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. This attribute map is a per request copy where we take a snapshot of the attributes so that we can use them in the scheduling layer. Given the per request nature of the map, I think it won't have contention because it will take like t < microsecond to update the map. I think its reasonable to use sync map here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we scale test to ensure we don't have any regression? We can consider baseline metrics what we have here: #1458

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to do this in next PRs since we are not actually using the map in this change. Thanks!

@rahulgurnani rahulgurnani force-pushed the preparedata-changes branch 3 times, most recently from 3e16069 to e3021c7 Compare November 3, 2025 03:13
return reqCtx, errutil.Error{Code: errutil.ServiceUnavailable, Msg: "failed to find candidate pods for serving the request"}
}

// TODO(rahulgurnani/lukevandrie): Perhaps, refactor/implement Admit plugin for Admission control.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is important. what is the relation between admission controller Admit and the admitRequest plugins?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

This would be a break in contract of how flow control operates. This specific plugin is for request specific semantics. We have currently do not have Flow Control considering request specific semantics, and there hasnt been a proposal suggesting this change. I think we should remove this todo until we have strong reasoning to actually do this work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my apologize, but I don't understand the intention of this new Admission plugin.
I thought we want to have admission check pluggable, but it seems now that we have two types of admission checks, with two different interfaces.
this seems wrong.

Copy link
Contributor Author

@rahulgurnani rahulgurnani Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment. Thanks for the catch!

result, err := d.scheduler.Schedule(ctx, reqCtx.SchedulingRequest, d.toSchedulerPodMetrics(candidatePods))
// Prepare per request data
// TODO(rahulgurnani): Add retries and timeout in the preparedata step.
d.runPrepareDataPlugins(ctx, reqCtx.SchedulingRequest, snapshotOfCandidatePods)
Copy link
Contributor

@nirrozenbaum nirrozenbaum Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we create snapshot of candidate pods?
we should work with candidatePods and create a snapshot only when calling the scheduler.
this is true for both prep data and admit request.
helper functions in the director should not rely on the internal scheduler representation of the endpoints.

Copy link
Collaborator

@kfswain kfswain Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both prep data and admit request are request specific, and so if we add request specific data to the shared endpoints that could risk data corruption.

Snapshotting before these steps ensures that this data lifecycle is only in the context it is consumed in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention of converting the endpoints representation to scheduler internal structure was only for the purpose of sending it to the scheduler.
PodMetrics has MetricsState behind atomic pointer and reading the metrics is an atomic operation (read all metrics in one operation).
I must be missing something although I've read the proposal doc.

return c
}

// WithPrepareDataPlugins sets the given plugins as the PrepareData plugins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// WithPrepareDataPlugins sets the given plugins as the PrepareData plugins.
// WithDataProducers sets the given plugins as the DataProducer plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirrozenbaum I reverted the change. The name for the plugin to be PrepareDataPlugin is apt becuase the plugin could produce as well as consume data. Please review the last commit.


result, err := d.scheduler.Schedule(ctx, reqCtx.SchedulingRequest, d.toSchedulerPodMetrics(candidatePods))
// Run admit request plugins
if !d.runAdmitRequestPlugins(ctx, reqCtx.SchedulingRequest, snapshotOfCandidatePods) {
Copy link
Contributor

@nirrozenbaum nirrozenbaum Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example of how AdmissionPlugin is using candidate pods?
IMO admission check should include only the request (including its metadata like headers) and the system state (like saturation or other system metrics).

@kfswain I really think we should think carefully on the interfaces and align on them.
it doesn't make sense to me to pass the candidate pods to admission check.

I would be happy to get an example for an admission check that is per request data and depends on the candidate pods (other than saturation that is checked separately).
I'm a bit concerned about over complication here, and unless there is a real use case for that we should probably wait with this addition (talking only about admission, data producer/consumer is clearly needed).

separately, I think there is a missing struct, like RequestState, that conceptually should be similar to PluginState we have today. RequestState should be created at the beginning of the request, and passed around for RequestDataProducers to be filled, and later on to ConsumerPlugins (probably instead of CycleState).

Copy link
Contributor Author

@rahulgurnani rahulgurnani Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly following recommendation in https://docs.google.com/document/d/1EQwXL2pCuUyM1B917FUgP_8pFS3VF8F_bUfjy8IE7gM/edit?tab=t.vmaefhinvkl5#heading=h.5qzcvbilfn6d

I think we would be adding the first AdmitRequest plugin with latency predictor.

The AdmitRequest plugin would consume data produced by latency predictor (producer)

@kfswain please keep me honest here. Thanks!

@rahulgurnani rahulgurnani changed the title Add DataProducer PrepareData and Admission control plugins [WIP] Add DataProducer PrepareData and Admission control plugins Nov 11, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 11, 2025
@rahulgurnani rahulgurnani changed the title [WIP] Add DataProducer PrepareData and Admission control plugins Add DataProducer PrepareData and Admission control plugins Nov 11, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2025
@rahulgurnani rahulgurnani force-pushed the preparedata-changes branch 2 times, most recently from 9b145a6 to f9c580c Compare November 11, 2025 20:17
@rahulgurnani
Copy link
Contributor Author

Hi @kfswain , @nirrozenbaum

I updated the PR to not have the DAG validation and parallel execution changes. Instead, made some changes to migrate the prefix cache match plugin to use the new prepare data plugin. I will follow up with the DAG related changes in a follow up PR, they are currently staged at: rahulgurnani@18592b7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants