Skip to content

[WIP] GIE v1 Migration#941

Closed
KillianGolds wants to merge 53 commits into
release-v0.15from
RHOAIENG-34472-dev
Closed

[WIP] GIE v1 Migration#941
KillianGolds wants to merge 53 commits into
release-v0.15from
RHOAIENG-34472-dev

Conversation

@KillianGolds
Copy link
Copy Markdown

@KillianGolds KillianGolds commented Oct 16, 2025

What this PR does / why we need it:
Migrates the Gateway API Inference Extension (GIE) from v1alpha2 to v1.0.0 while maintaining backward compatibility through a dual-pool strategy.
The controller creates both v1 and v1alpha2 InferencePool objects simultaneously, using v1 as the primary interface while allowing v1alpha2 to function as a fallback in environments without GIE v1 controller integration (e.g., current OpenShift).

Key changes:

  • Upgrade Kubernetes dependencies (v0.33.4 → v0.34.1), Gateway API (v1.3.0 → v1.4.0), and KEDA (v2.16.1 → v2.18.0)
  • Implement dual-pool creation strategy: typed client for v1 InferencePool, dynamic client for v1alpha2
  • Add HTTPRoute migration logic with traffic weight management (v1: 100%, v1alpha2: 0% initially)
  • Implement one-way migration: once v1 pool becomes ready, permanently switch to v1 (no fallback)
  • Add ToCorev1PodSpec() method for structured-merge-diff v6 compatibility
  • Restore critical config merge logic to prevent nil container fields
  • Update RBAC permissions for GIE v1 resources (inferencemodels, inferenceobjectives)

Which issue(s) this PR fixes:
Fixes RHOAIENG-34472

Type of changes

  • New feature (non-breaking change which adds functionality)
  • Dependency upgrade (Kubernetes v0.34, Gateway API v1.4, KEDA v2.18)

Upstream Blockers Resolved:
This PR encountered several upstream dependency issues requiring coordination with multiple projects:

  1. GIE v1.0.0 Validation Bug (kubernetes-sigs/gateway-api-inference-extension#1679)

    • Invalid kubebuilder validation markers in GIE v1.0.0 CRDs
    • Used local fork workaround until upstream fix was merged and backported
    • Status: ✅ Resolved upstream
  2. Kubernetes v0.34 Upgrade Requirement

    • GIE v1 + Gateway API v1.4 dependency chain required upgrading from Kubernetes v0.33.4 to v0.34.1
    • Required refactoring code to accommodate API changes
    • Status: ✅ Completed
  3. Gateway API Validation Markers Bug (kubernetes-sigs/gateway-api#4172)

    • Same validation marker issues found in Gateway API v1.4.0
    • Created personal fork workaround for CI unblocking
    • Status: ⏳ Upstream PR in review, using temporary fork

Feature/Issue validation/testing:

  • Unit tests for llmisvc controller components
  • Config merge and template processing tests
  • Router discovery and URL generation tests
  • HTTPRoute migration and traffic weight tests
  • Dual-pool fallback strategy tests
  • Integration tests (38/38 passing)
  • E2E tests (in progress via manual CI triggering to manage quota)

Special notes for your reviewer:

  1. Dual-Pool Strategy: Creates both v1 and v1alpha2 InferencePools to support environments with/without GIE v1 controller
  2. One-Way Migration: HTTPRoute traffic weights implement permanent migration (v1alpha2 → v1) once v1 pool is ready
  3. Backward Compatibility: Existing v1alpha2 deployments continue to work unchanged
  4. Temporary Fork Workaround: Using personal Gateway API fork until upstream validation fix merges (does not affect production, only CI/build)
  5. OpenShift Compatibility: Fallback to v1alpha2 ensures functionality until OpenShift integrates GIE v1 controller
  6. CI Quota Management: Running E2E tests sequentially via draft PR to avoid overwhelming cluster quota

Known Limitations:

  • Awaiting upstream Gateway API validation fix merge before removing temporary fork dependency
  • E2E tests may fail in environments without GIE v1alpha2 controller (CI env has controller)

Please confirm that if this PR changes any image versions, then that’s the sole change this PR makes.

  • ❌ This PR includes dependency upgrades alongside feature implementation

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:

Migrate Gateway API Inference Extension from v1alpha2 to v1.0.0 with backward-compatible
dual-pool fallback strategy. Upgrades Kubernetes to v0.34.1, Gateway API to v1.4.0, and
KEDA to v2.18.0.

Re-running failed tests:

  • /rerun-all - rerun all failed workflows
  • /rerun-workflow - rerun a specific failed workflow (one at a time)
  • /test e2e-llm-inference-service - manually trigger specific E2E test (use when managing CI quota)

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Oct 16, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Oct 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KillianGolds
Once this PR has been reviewed and has the lgtm label, please assign andresllh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 16, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch RHOAIENG-34472-dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds KillianGolds requested a review from pierDipi October 16, 2025 12:41
@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

7 similar comments
@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

3 similar comments
@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

Comment thread charts/kserve-resources/templates/clusterrole.yaml
Comment thread cmd/manager/main.go
Comment thread config/llmisvc/config-llm-router-route.yaml
Comment thread config/llmisvc/config-llm-router-route.yaml Outdated
Comment thread localmodel-agent.Dockerfile
Comment thread localmodel.Dockerfile
KillianGolds and others added 21 commits October 20, 2025 01:58
…TPRoute creation

Removed the premature InferencePool readiness check that caused E2E test failures.
The Gateway Controller can only populate InferencePool status after an HTTPRoute
references the pool. Checking readiness before the route exists always fails
due to empty status.

Readiness is now evaluated in EvaluateInferencePoolConditions(), which runs
after reconcileHTTPRoutes() creates the route. This allows the Gateway Controller
to populate pool status before evaluation, fixing the issue where neither v1 nor
v1alpha2 pools were becoming ready.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
…pport

OpenShift CI was using the standard Gateway Controller, which cannot manage
InferencePools, causing all scheduler-based E2E tests to fail. Switched to
upstream Istio with GIE v1.0.0 support to enable proper InferencePool reconciliation.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED

fix(ci): install GIE controller in OpenShift E2E for InferencePool support

OpenShift CI was using the standard Gateway Controller, which cannot manage
InferencePools, causing scheduler E2E tests to fail with Neither v1 nor
v1alpha2 InferencePool is ready.

Switched to Istio with GIE support to provide a Gateway Controller that can
reconcile InferencePool resources and populate their Status.Parents[].

Changes:
- Install GIE v1.0.0 CRDs (InferencePool, InferenceModel)
- Deploy Istio v1.27-alpha with GIE controller support
- Create GatewayClass pointing to istio.io/gateway-controller
- Skip Gateway API CRDs (OpenShift 4.19.9+ manages them)

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Added ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true to istiod to enable
InferencePool status population by Istio’s Gateway Controller.

This hopefully fixes InferencePool is not ready errors in OpenShift E2E tests.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Implemented retry.RetryOnConflict in the Update function and
reconcileInferencePoolMigration to handle resource version conflicts
during v1 InferencePool migration.

Fixes a race condition where HTTPRoute status and spec updates
conflicted, causing four integration test failures.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
E2E tests expect the gatewayClassName openshift-default, but the Istio
setup script was creating istio, causing all tests to fail with
GatewayClass not found errors.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Replaced personal fork workaround with the upstream gateway-api main branch
now that PR kserve#4178 (validation markers fix) has been merged upstream.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
…test timing

Added retry logic with exponential backoff for finalizer add/remove operations
to handle resource version conflicts during concurrent reconciliation.

Updated the integration test to wait for asynchronous InferencePool migration
completion before asserting backend ref configuration, fixing a race condition
where the test checked HTTPRoute backend refs before the controller finished
updating weights.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Updated go.sum and openapi_generated.go with changes from the precommit hook.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Moved HTTPRoute trimming logic to apply to both scheduler and no-scheduler
cases. The well-known config defines three rules, but GIE v1 migration
requires a single rule with dual backend refs.

Fixes four failing integration tests.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Co-authored-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Signed-off-by: Killian Golds <91667190+KillianGolds@users.noreply.github.com>
…validation

Added missing required fields to the InferencePool status to satisfy
GIE v1 validation requirements.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Fixed a pointer comparison bug where Service backends were never detected.
Changed from pointer comparison to value comparison:

  - if backendRef.Kind == &serviceKind {
  + if backendRef.Kind != nil && *backendRef.Kind == serviceKind {

This bug caused status URLs to incorrectly use InferencePool paths instead
of Service paths, leading to duplicated paths in some test scenarios
(e.g., /v1/completions/v1/completions).

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Added rhcl_available fixture and skipif decorators to auth tests per PR #939.
Tests now skip gracefully when RHCL is not installed in the environment.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
…ions

LLMInferenceServiceConfig now uses plain string types instead of GIE v1 types
directly. This fixes deserialization issues in the ODH Model Controller.

The controller now converts these plain types to GIE v1 types at runtime when
creating actual InferencePool resources.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Fixed linter issues and updated generated code:
- Removed unnecessary string conversion in EPPServiceName
- Pre-allocated targetPorts slice in ConvertKServePoolToV1
- Fixed whitespace alignment in tests
- Regenerated OpenAPI schemas
- Updated go.sum dependencies

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Changed v1alpha2 pool weight from 0 to 1 to enable automatic fallback.

Previous behavior (weight: 0):
- v1 ready → uses v1
- v1 NOT ready → route fails

New behavior (weight: 1):
- v1 ready → uses v1 (priority with 100:1 ratio)
- v1 NOT ready → falls back to v1alpha2

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Changed from @pytest.mark.skipif decorator to runtime pytest.skip() within
the test body. The fixture cannot be referenced in the skipif expression context.

Error:
  NameError: name 'rhcl_available' is not defined

Also includes precommit whitespace fixes in openapi_generated.go.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Removed duplicate test fixture builder functions with hardcoded weights
in favor of flexible, weight-parameterized versions:

- Removed BackendRefInferencePool() → use BackendRefInferencePoolV1Alpha2(name, 1)
- Removed BackendRefService() → use BackendRefServiceWithWeight(name, 1)

Also fixed test expectations for the dual-pool fallback strategy:
- Changed v1alpha2 InferencePool weight from 0 to 1 to enable automatic fallback when v1 is not ready

Updated all test usages to use the parameterized builder functions.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

The controller was using .Owns() for v1 InferencePool, which only watches
metadata and spec changes, not status updates. This caused the migration
annotation test to timeout because the controller did not reconcile when
the pool became ready.

Added a .Watches() handler to trigger reconciliation on v1 InferencePool
status changes, allowing the migration logic to detect when v1 becomes
ready and set the migration annotation to switch traffic from v1alpha2.

Signed-off-by: Killian Golds <kgolds@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@KillianGolds
Copy link
Copy Markdown
Author

/test e2e-llm-inference-service

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Oct 20, 2025

@KillianGolds: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-llm-inference-service b68114a link true /test e2e-llm-inference-service

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@pierDipi
Copy link
Copy Markdown
Member

Superseded by #948

@pierDipi pierDipi closed this Oct 28, 2025
@github-project-automation github-project-automation Bot moved this from New/Backlog to Done in ODH Model Serving Planning Oct 28, 2025
@spolti spolti deleted the RHOAIENG-34472-dev branch December 9, 2025 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants