Skip to content

[WIP] Update Gateway API Inference Extension to v1#906

Closed
KillianGolds wants to merge 4 commits into
masterfrom
feature/update-gie-to-v1
Closed

[WIP] Update Gateway API Inference Extension to v1#906
KillianGolds wants to merge 4 commits into
masterfrom
feature/update-gie-to-v1

Conversation

@KillianGolds
Copy link
Copy Markdown

@KillianGolds KillianGolds commented Sep 26, 2025

What this PR does / why we need it:
Migrates the Gateway API Inference Extension (GIE) from v1alpha2 to v1.0.0 while maintaining backward compatibility through a dual-pool strategy.
The controller creates both v1 and v1alpha2 InferencePool objects simultaneously, using v1 as the primary interface while allowing v1alpha2 to function as a fallback in environments without GIE v1 controller integration (e.g., current OpenShift).

Key changes:

  • Upgrade Kubernetes dependencies (v0.33.4 → v0.34.1), Gateway API (v1.3.0 → v1.4.0), and KEDA (v2.16.1 → v2.18.0)
  • Implement dual-pool creation strategy: typed client for v1 InferencePool, dynamic client for v1alpha2
  • Add HTTPRoute migration logic with traffic weight management (v1: 100%, v1alpha2: 0% initially)
  • Implement one-way migration: once v1 pool becomes ready, permanently switch to v1 (no fallback)
  • Add ToCorev1PodSpec() method for structured-merge-diff v6 compatibility
  • Restore critical config merge logic to prevent nil container fields
  • Update RBAC permissions for GIE v1 resources (inferencemodels, inferenceobjectives)

Which issue(s) this PR fixes:
Fixes RHOAIENG-34472

Type of changes

  • New feature (non-breaking change which adds functionality)
  • Dependency upgrade (Kubernetes v0.34, Gateway API v1.4, KEDA v2.18)

Upstream Blockers Resolved:
This PR encountered several upstream dependency issues requiring coordination with multiple projects:

  1. GIE v1.0.0 Validation Bug (kubernetes-sigs/gateway-api-inference-extension#1679)

    • Invalid kubebuilder validation markers in GIE v1.0.0 CRDs
    • Used local fork workaround until upstream fix was merged and backported
    • Status: ✅ Resolved upstream
  2. Kubernetes v0.34 Upgrade Requirement

    • GIE v1 + Gateway API v1.4 dependency chain required upgrading from Kubernetes v0.33.4 to v0.34.1
    • Required refactoring code to accommodate API changes
    • Status: ✅ Completed
  3. Gateway API Validation Markers Bug (kubernetes-sigs/gateway-api#4172)

    • Same validation marker issues found in Gateway API v1.4.0
    • Created personal fork workaround for CI unblocking
    • Status: ⏳ Upstream PR in review, using temporary fork

Feature/Issue validation/testing:

  • Unit tests for llmisvc controller components
  • Config merge and template processing tests
  • Router discovery and URL generation tests
  • HTTPRoute migration and traffic weight tests
  • Dual-pool fallback strategy tests
  • Integration tests (38/38 passing)
  • E2E tests (in progress via manual CI triggering to manage quota)

Special notes for your reviewer:

  1. Dual-Pool Strategy: Creates both v1 and v1alpha2 InferencePools to support environments with/without GIE v1 controller
  2. One-Way Migration: HTTPRoute traffic weights implement permanent migration (v1alpha2 → v1) once v1 pool is ready
  3. Backward Compatibility: Existing v1alpha2 deployments continue to work unchanged
  4. Temporary Fork Workaround: Using personal Gateway API fork until upstream validation fix merges (does not affect production, only CI/build)
  5. OpenShift Compatibility: Fallback to v1alpha2 ensures functionality until OpenShift integrates GIE v1 controller
  6. CI Quota Management: Running E2E tests sequentially via draft PR to avoid overwhelming cluster quota

Known Limitations:

  • Awaiting upstream Gateway API validation fix merge before removing temporary fork dependency
  • E2E tests may fail in environments without GIE v1alpha2 controller (CI env has controller)

Please confirm that if this PR changes any image versions, then that’s the sole change this PR makes.

  • ❌ This PR includes dependency upgrades alongside feature implementation

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:

Migrate Gateway API Inference Extension from v1alpha2 to v1.0.0 with backward-compatible
dual-pool fallback strategy. Upgrades Kubernetes to v0.34.1, Gateway API to v1.4.0, and
KEDA to v2.18.0.

Re-running failed tests:

  • /rerun-all - rerun all failed workflows
  • /rerun-workflow - rerun a specific failed workflow (one at a time)
  • /test e2e-llm-inference-service - manually trigger specific E2E test (use when managing CI quota)

Signed-off-by: Killian Golds <kgolds@redhat.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Sep 26, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 26, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/update-gie-to-v1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Sep 26, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KillianGolds
Once this PR has been reviewed and has the lgtm label, please assign hdefazio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread charts/llmisvc-resources/templates/config-llm-router-route.yaml
Signed-off-by: Killian Golds <kgolds@redhat.com>
Signed-off-by: Killian Golds <kgolds@redhat.com>
var needsAnnotationUpdate bool

// Check if we've already committed to v1
if llmSvc.Annotations != nil && llmSvc.Annotations[AnnotationInferencePoolMigrated] == "v1" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the child objects to record the persistent failover to v1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(like HTTPRoute?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a good pattern to follow is to never touch the spec and metadata of user managed objects

if llmSvc.Annotations == nil {
llmSvc.Annotations = make(map[string]string)
}
llmSvc.Annotations[AnnotationInferencePoolMigrated] = "v1"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above

_, err = res.Update(ctx, u, metav1.UpdateOptions{})
return err
}
_, err = res.Create(ctx, u, metav1.CreateOptions{})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, we need to mimic the Reconcile function in lifecycle_crud.go

Copy link
Copy Markdown
Member

@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback:

  • re-route the PR to target the release branch release-v0.15
  • add tests in controller_int_test.go

  Apply lifecycle_crud.go patterns to dynamic client reconciliation:
  ownership checks, semantic equality, event emission. Break down large
  migration function into helpers. Store state on child objects.

Signed-off-by: Killian Golds <kgolds@redhat.com>
@github-project-automation github-project-automation Bot moved this from New/Backlog to Done in ODH Model Serving Planning Oct 17, 2025
@spolti spolti deleted the feature/update-gie-to-v1 branch December 9, 2025 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants