Skip to content

test: add migration e2e test for OCM environment#2243

Merged
openshift-merge-bot[bot] merged 32 commits into
stolostron:mainfrom
yanmxa:fix-migration-ocm-e2e
Jan 22, 2026
Merged

test: add migration e2e test for OCM environment#2243
openshift-merge-bot[bot] merged 32 commits into
stolostron:mainfrom
yanmxa:fix-migration-ocm-e2e

Conversation

@yanmxa
Copy link
Copy Markdown
Contributor

@yanmxa yanmxa commented Jan 14, 2026

Related Jira Issue

https://issues.redhat.com/browse/ACM-27900

Summary

This PR adds end-to-end tests for cluster migration in pure OCM (Open Cluster Management) environment without ACM/MCE.

Changes

E2E Test (test/e2e/migration_test.go)

  • Added comprehensive migration e2e tests covering all phases:

    • Validating: Verify source/target hubs and cluster existence
    • Initializing: Create bootstrap secret and update klusterlet config
    • Deploying: Apply ManifestWork with migration resources
    • Registering: Monitor cluster registration on target hub
    • Completed: Verify successful migration
  • Added verifyAutoApproveUsersSupport() validation in BeforeSuite to ensure the ClusterManager CRD supports autoApproveUsers field before running tests

  • Added klusterlet restoration logic in AfterAll to maintain test environment

E2E Setup (test/script/e2e_setup.sh, test/script/util.sh)

  • Updated clusteradm to use --bundle-version=v1.1.0 which includes autoApproveUsers support in the ClusterManager CRD
  • Added KlusterletConfig CRD installation for migration tests
  • Added managed-serviceaccount addon installation on global hub

Agent Changes (agent/pkg/spec/migration/migration_to_syncer.go)

  • Added OCM ClusterRole fallback: tries ACM ClusterRole first (open-cluster-management:managedcluster:bootstrap:agent-registration), then falls back to OCM ClusterRole (open-cluster-management:bootstrap) for pure OCM environments
  • Added isOCMEnvironment() detection to handle environment differences
  • Added 1-minute delay in OCM environment during Initializing phase to allow ManifestWork to be applied

OCM autoApproveUsers Support

The migration functionality requires the autoApproveUsers field in the ClusterManager CRD to automatically approve CSRs for migrating clusters.

Key Finding: OCM v1.1.0 bundle correctly includes autoApproveUsers in the CRD schema. Using clusteradm init --bundle-version=v1.1.0 provides full support for this feature.

Testing

Tested with:

  • make e2e-test-migration - All 8 tests passed
  • Images: quay.io/myan/multicluster-global-hub-{operator,manager,agent}:latest

Test Flow

  1. Setup: Verify managed-serviceaccount addon and ClusterManager CRD support
  2. Create Migration CR: Trigger migration from hub1 to hub2
  3. Validating Phase: Manager validates source/target hubs and cluster existence
  4. Initializing Phase: Source agent creates ManifestWork with bootstrap secret and updated klusterlet config
  5. Deploying Phase: Target agent applies resources to managed cluster
  6. Registering Phase: Target agent monitors cluster registration with ReadOnly ManifestWork
  7. Completed Phase: Verify cluster is Available on target hub and removed from source hub
  8. Cleanup: Restore klusterlet and managed cluster acceptance

🤖 Generated with Claude Code

@yanmxa
Copy link
Copy Markdown
Contributor Author

yanmxa commented Jan 17, 2026

/retest

yanmxa added a commit to yanmxa/hub-of-hubs that referenced this pull request Jan 18, 2026
The clusteradm CLI version must match the OCM bundle version being used.
Since we're using --bundle-version=v1.1.0 in clusteradm init, we need
clusteradm CLI v1.1.1 (the CLI version that supports this bundle).

Root cause of e2e setup failure:
- clusteradm v1.0.1 CLI doesn't properly support --bundle-version=v1.1.0
- This causes "hub oriented command should not running against non-hub cluster" error
- The init command fails silently, leaving clusters in non-hub state

Fix:
- Update CLUSTERADM_VERSION from 1.0.1 to 1.1.1
- This matches the bundle version v1.1.0 used in init_hub function

Fixes CI test-e2e failure in PR stolostron#2243

Signed-off-by: Meng Yan <myan@redhat.com>
yanmxa and others added 27 commits January 22, 2026 12:46
The ClusterManager CRD update must be done AFTER clusteradm init creates
the initial CRD, not before. Otherwise, the outdated CRD from clusteradm
v1.1.0 bundle will overwrite our update.

Changes:
- Move CRD update from before OCM installation to after OCM installation
- Add comment explaining this must be done after clusteradm init
- Keep issue reference for tracking OCM bundle update

This ensures the migration e2e tests have access to the autoApproveUsers
field required for cluster migration functionality.

Related: open-cluster-management-io/ocm#1334

Signed-off-by: Meng Yan <myan@redhat.com>
The clusteradm CLI version must match the OCM bundle version being used.
Since we're using --bundle-version=v1.1.0 in clusteradm init, we need
clusteradm CLI v1.1.1 (the CLI version that supports this bundle).

Root cause of e2e setup failure:
- clusteradm v1.0.1 CLI doesn't properly support --bundle-version=v1.1.0
- This causes "hub oriented command should not running against non-hub cluster" error
- The init command fails silently, leaving clusters in non-hub state

Fix:
- Update CLUSTERADM_VERSION from 1.0.1 to 1.1.1
- This matches the bundle version v1.1.0 used in init_hub function

Fixes CI test-e2e failure in PR stolostron#2243

Signed-off-by: Meng Yan <myan@redhat.com>
Migration test changes cluster state (migrates hub1-cluster1 from hub1
to hub2), which breaks the BYO test suite that runs localpolicy tests
expecting clusters to be on their original hubs.

Test execution order:
1. localpolicy, backup, grafana, local-agent tests (on original setup)
2. prune test
3. clean globalhub
4. BYO tests (re-run localpolicy etc with BYO storage)
5. migration test (last, to avoid breaking BYO tests)

This ensures migration test won't affect other tests that depend on
cluster placement.

Fixes CI test-e2e failure where BYO localpolicy test failed because
hub1-cluster1 was no longer on hub1 after migration test.

Signed-off-by: Meng Yan <myan@redhat.com>
After e2e_clean_globalhub.sh undeploys the operator, the
multicluster-global-hub namespace is deleted. When migration tests
run afterward, e2e_run.sh fails trying to apply ConfigMap to a
non-existent namespace.

This fix ensures the namespace exists before applying the ConfigMap,
allowing tests to run successfully after cleanup.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
After BYO tests run in "mgh" namespace, NodePort 30080 is occupied
by the multicluster-global-hub-manager-nonk8s-service. When migration
tests try to create the same service in "multicluster-global-hub"
namespace, it fails because NodePort is cluster-scoped.

Add cleanup step after BYO tests to free up the NodePort before
migration tests run.

Error fixed:
Service "multicluster-global-hub-manager-nonk8s-service" is invalid:
spec.ports[0].nodePort: Invalid value: 30080: provided port is already allocated

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
…rator

The previous fix called e2e_clean_globalhub.sh which runs 'make undeploy',
deleting global operator CRDs and RBAC resources. This broke migration tests
that need the operator running.

Now only delete the MulticlusterGlobalHub instance in 'mgh' namespace,
keeping operator deployment intact for migration tests.

Error fixed:
- Operator failed to start: configmaps "controller-config" is forbidden
- BeforeSuite timeout waiting for operator lease update

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
Deleting MulticlusterGlobalHub CR doesn't immediately delete the Service.
The NodePort 30080 remains allocated, causing migration test to fail when
trying to create the same service.

Explicitly delete the service to free up NodePort 30080 before migration
test starts.

Error fixed:
Service "multicluster-global-hub-manager-nonk8s-service" is invalid:
spec.ports[0].nodePort: Invalid value: 30080: provided port is already allocated

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
This commit consolidates cleanup logic, removes redundant code, and improves
code maintainability based on the PR optimization plan.

Changes:
- Code simplification in test/e2e/migration_test.go:
  * Extracted isManagedClusterAvailable() and isManifestWorkApplied() helpers
  * Added addIfNotEmpty() helper to reduce repetitive if-statements
  * Commented out direct resource application code (lines 420-465) since
    ManifestWork is now working correctly with proper work-agent RBAC setup

- Cleanup logic consolidation:
  * Moved MulticlusterGlobalHub and service cleanup from test/Makefile to
    test/script/e2e_run_byo.sh, keeping all BYO test logic in one place
  * Cleanup now happens at the end of BYO tests, before migration tests

- Documentation cleanup:
  * Deleted archived ai-doc/migration-e2e-manual-test.md (automated test exists)

- Configuration simplification:
  * Removed hardcoded bundle-version=v1.1.0 from util.sh init_hub()
  * Now uses clusteradm's default bundle version matching the binary version
  * Introduced GLOBAL_HUB_KUBECONFIG variable in test/Makefile for reusability

Rationale:
- The ManifestWork implementation is working correctly in the e2e environment,
  making direct resource application redundant. Code is commented (not deleted)
  for easy restoration if needed in the future.
- Consolidating cleanup into the BYO script makes the test flow clearer and
  the Makefile simpler.
- Removing bundle-version parameter eliminates version inconsistency between
  clusteradm binary (v1.1.1) and bundle version (v1.1.0).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
The workaround that applied the latest ClusterManager CRD to get
autoApproveUsers support was unnecessary. Testing confirmed that all
OCM versions since v0.13.0 already include the autoApproveUsers field
in their bundled ClusterManager CRD.

Verified versions (all include autoApproveUsers):
- clusteradm v1.1.1 (bundle v1.1.1, operator v1.1.1)
- clusteradm v1.1.0 (bundle v1.1.0, operator v1.1.0)
- clusteradm v1.0.0 (bundle v1.0.0, operator v1.0.0)
- clusteradm v0.11.2 (bundle v0.16.1, operator v0.16.1)
- clusteradm v0.10.1 (bundle v0.15.2, operator v0.15.2)
- clusteradm v0.9.0 (bundle v0.14.0, operator v0.14.0)
- clusteradm v0.8.0 (bundle v0.13.0, operator v0.13.0)

The original issue report (open-cluster-management-io/ocm#1334) was
based on incorrect assumptions and has been closed.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
Add .claude/rules/ to .gitignore to keep project-specific Claude Code
rules locally without committing them to the repository. This allows
developers to maintain custom rules for their workflow while keeping
the repository clean.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
Remove .claude/rules/e2e-code-update.md from git tracking. This file
was accidentally committed in a previous commit and should remain local
only, as specified in .gitignore.

The file is kept locally for development use but excluded from the PR.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
Address PR review feedback to improve code quality:

- Shell scripts: Standardize variable references to use ${VAR} syntax
  for consistency and robustness across:
  * test/script/e2e_run.sh
  * test/script/e2e_run_byo.sh
  * test/script/e2e_setup.sh
  * test/script/event_exporter_kafka.sh

- Go code: Enhanced comments in migration_to_syncer.go to clarify:
  * OCM environment delay rationale (line 412)
  * ClusterRole priority detection logic (line 777)

These changes address automated review suggestions while maintaining
existing functionality.

Resolves: Review comments in stolostron#2243
Signed-off-by: Meng Yan <myan@redhat.com>
The migration e2e test was failing because the ClusterManager CRD
doesn't support the autoApproveUsers field, which is required for
auto-approval configuration during cluster migration.

This change updates the ClusterManager CRD to the latest version from
OCM main branch before running the tests, ensuring the CRD includes
support for autoApproveUsers field.

Error in CI:
  autoApproveUsers should be saved in ClusterManager. If this fails,
  apply the latest ClusterManager CRD from OCM main branch

Fix:
  Apply latest CRD in e2e_setup.sh after OCM installation completes

Resolves: Migration e2e test failure in PR stolostron#2243
Signed-off-by: Meng Yan <myan@redhat.com>
The previous fix only updated the ClusterManager CRD on the global hub,
but the migration e2e test verifies autoApproveUsers support on the
target hub (hub2), which is a managed hub.

This change ensures the ClusterManager CRD is updated on all hubs:
- Global hub (global-hub)
- All managed hubs (hub1, hub2, etc.)

This fixes the test failure where targetHubClient (hub2) was checking
for autoApproveUsers support but the CRD wasn't updated on that hub.

Error in CI:
  verifyAutoApproveUsersSupport on targetHubClient (hub2) was failing
  because CRD was only updated on global-hub

Fix:
  Loop through all managed hubs and apply CRD update to each

Resolves: Migration e2e test failure in PR stolostron#2243
Signed-off-by: Meng Yan <myan@redhat.com>
This reverts commits b880da5 and 05ece32.

According to open-cluster-management-io/ocm#1334,
the autoApproveUsers field has been present in ClusterManager CRD since
OCM v0.13.0 (March 2024). The manual CRD update was based on a
misunderstanding and is not needed.

The real issue with the e2e test failure needs to be investigated further.

Signed-off-by: Meng Yan <myan@redhat.com>
The autoApproveUsers field in ClusterManager only takes effect when the
ManagedClusterAutoApproval feature gate is enabled. The test was failing
because it was setting autoApproveUsers without enabling the required
feature gate.

According to OCM documentation and the ClusterManager type definition:
  // AutoApproveUser represents a list of users that can auto approve CSR
  // and accept client. This takes effect only when ManagedClusterAutoApproval
  // feature gate is enabled.

Changes:
- Enable ManagedClusterAutoApproval feature gate before testing autoApproveUsers
- Add check to avoid enabling the feature gate if already enabled
- Update comment to clarify the feature gate requirement

This also reverts the unnecessary CRD updates (commits b880da5 and 05ece32)
as the autoApproveUsers field has been present since OCM v0.13.0.
See: open-cluster-management-io/ocm#1334

Resolves: Migration e2e test failure in PR stolostron#2243
Signed-off-by: Meng Yan <myan@redhat.com>
…gate

The ClusterManager controller needs time to process the feature gate
change before we can successfully set autoApproveUsers. Without this
wait, the autoApproveUsers field gets silently dropped.

Changes:
- Add 5 second wait after enabling ManagedClusterAutoApproval feature gate
- Re-fetch ClusterManager object to ensure we have the latest state
- This ensures the feature gate is fully processed before setting autoApproveUsers

Resolves: autoApproveUsers not being saved in e2e test
Signed-off-by: Meng Yan <myan@redhat.com>
Replace hard-coded time.Sleep with Eventually polling pattern for better
reliability and clearer test intent. This approach:

1. Waits up to 2 minutes for ClusterManager controller to process feature gate
2. Polls every 5 seconds instead of blocking
3. Provides better error messages on timeout
4. Retries autoApproveUsers update in case of transient issues

Benefits:
- More robust: handles variable processing times
- Faster: succeeds as soon as ready instead of always waiting 5s
- Clearer: explicit timeout and polling interval
- Better error reporting: Eventually provides helpful timeout messages

Also use slices.Contains for cleaner code per linter suggestion.

Suggested-by: User feedback
Signed-off-by: Meng Yan <myan@redhat.com>
The previous approach of enabling the feature gate first and then setting
autoApproveUsers in a separate update was causing timing issues with the
ClusterManager webhook/controller.

Root cause: The ClusterManager controller may reset or validate the spec
after each update. Setting them separately can cause the second update to
fail because the webhook validation happens before the feature gate is
fully processed.

Solution: Mirror the actual migration code pattern by setting both the
feature gate and autoApproveUsers in a single update operation. This
ensures they are validated together and avoids race conditions.

Changes:
- Combine feature gate enablement and autoApproveUsers setting in one update
- Use Eventually with 2min timeout to handle transient conflicts
- Add better error messages showing actual autoApproveUsers value
- Remove separate wait step between feature gate and autoApproveUsers

This matches the production code pattern in migration_to_syncer.go where
both are set together in RegistrationConfiguration.

Resolves: autoApproveUsers timeout in e2e test
Signed-off-by: Meng Yan <myan@redhat.com>
The test shows Update() succeeds but Get() immediately returns empty
autoApproveUsers list. Adding detailed logging to diagnose:

- Log FeatureGates and AutoApproveUsers before update
- Log AutoApproveUsers value being set
- Log update success
- Log retrieved FeatureGates, AutoApproveUsers, and ResourceVersion after Get
- Log verification success

This will help identify if:
1. A webhook is stripping the autoApproveUsers field
2. A controller is reconciling and removing it
3. The feature gate is not properly enabled
4. There's a timing/caching issue with the Get operation

Diagnostic commit for PR stolostron#2243 e2e failure investigation

Signed-off-by: Meng Yan <myan@redhat.com>
The test was failing with:
  unknown field "spec.registrationConfiguration.autoApproveUsers"

Root cause: clusteradm v1.1.1 deploys OCM with registration-operator v1.1.1,
which uses an older ClusterManager CRD that doesn't include the
autoApproveUsers field.

Solution: Explicitly apply the ClusterManager CRD from OCM API v0.16.0
during e2e setup. This version includes the autoApproveUsers field that
is required for migration testing.

The autoApproveUsers field was added in OCM API v0.16.0 (March 10, 2024)
via PR stolostron#357. It allows specifying users/service accounts that can
auto-approve cluster registrations when the ManagedClusterAutoApproval
feature gate is enabled.

Fixes: stolostron#2243
Signed-off-by: Meng Yan <myan@redhat.com>
The previous fix attempted to update the CRD before clusteradm init,
but clusteradm init deploys its own CRD version which overwrites our
update. This caused the autoApproveUsers field to remain unavailable.

Solution: Move the CRD update into the init_hub() function in util.sh,
right after clusteradm init completes. This ensures the v0.16.0 CRD
with autoApproveUsers support is applied after clusteradm's deployment.

Changes:
- test/script/util.sh: Add CRD update after clusteradm init in init_hub()
- test/script/e2e_setup.sh: Remove the early CRD update that was being overwritten

This ensures hub1 and hub2 have the correct CRD version that supports
the autoApproveUsers field required for migration testing.

Fixes: stolostron#2243
Signed-off-by: Meng Yan <myan@redhat.com>
clusteradm v1.1.1 uses OCM API v1.1.0 which already includes the
autoApproveUsers field in ClusterManager CRD. Manual update should
not be necessary.

Let's test if clusteradm init deploys the correct CRD version.

Signed-off-by: Meng Yan <myan@redhat.com>
Root cause: The ClusterManager CRD in test/manifest/crd/ was outdated and
did not include the autoApproveUsers field. The install_mch() function
applies this CRD, which overwrites the correct version deployed by
clusteradm init.

Solution: Update the CRD to OCM API v1.1.0 which includes autoApproveUsers
field support. This matches the version used by clusteradm v1.1.1.

The autoApproveUsers field allows specifying users/service accounts that
can auto-approve cluster CSRs when the ManagedClusterAutoApproval feature
gate is enabled. This is required for the migration e2e tests.

Fixes: stolostron#2243
Signed-off-by: Meng Yan <myan@redhat.com>
…tants

- Remove getBootstrapClusterRoleName function, use isOCMEnvironment directly
- Update test from TestGetBootstrapClusterRoleName to TestIsOCMEnvironment
- Replace hardcoded phase strings with migrationv1alpha1 constants in e2e tests

Signed-off-by: Meng Yan <yanmxa@gmail.com>
Signed-off-by: Meng Yan <myan@redhat.com>
Modified isOCMEnvironment() to return (bool, error) to detect when
neither ACM nor OCM bootstrap ClusterRole exists. This fixes the
TestInitializingWithNoClusterRole test which expects an error when
no bootstrap ClusterRole is available.

Signed-off-by: myan <myan@redhat.com>
Signed-off-by: Meng Yan <myan@redhat.com>
TestHandleStage and TestInitializing tests were failing because
they lacked the required ClusterRole for isOCMEnvironment() check.
The code checks for either ACM or OCM bootstrap ClusterRole existence,
and returns an error if neither exists.

Added the ClusterRole to:
- TestHandleStage/Handle_initializing_stage
- TestInitializing/Successful_initializing_with_minimal_ClusterManager
- TestInitializing/Initializing_with_existing_ClusterManager_configuration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Meng Yan <myan@redhat.com>
@clyang82
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jan 22, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clyang82, yanmxa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants