SPLAT-2587: aws/ccm: introduce configuration to CCM managed Security Groups for NLB#7460
SPLAT-2587: aws/ccm: introduce configuration to CCM managed Security Groups for NLB#7460mtulio wants to merge 4 commits intoopenshift:mainfrom
Conversation
|
@mtulio: This pull request references SPLAT-2587 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis pull request adds support for NLB (Network Load Balancer) managed security groups in the cloud controller manager (CCM). The changes extend IAM policies to grant ELB and EC2 security group management permissions, introduce a new feature gate ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test ? |
|
/test e2e-aws |
|
Checked the CI artifact and it isn't reflecting the expected change: # control-plane-operator/controllers/hostedcontrolplane/v2/assets/aws-cloud-controller-manager/config.yaml
[Global]
Zone = us-east-1a
VPC = vpc-07fbab2214e685cbf
KubernetesClusterID = 9af1be1f93a6e8ab6bf7-mgmt
SubnetID = subnet-01ed2d9ae2b3baf6c
ClusterServiceLoadBalancerHealthProbeMode = SharedTrying again forcing directly to the template /test e2e-aws |
|
I see the hosted cluster is created patching the cloud-config as expected (it duplicated the entry due my last commit with debug forcing in two different places): Direct URL: ---
apiVersion: v1
data:
aws.conf: |-
[Global]
Zone = us-east-1a
....
NLBSecurityGroupMode = Managed
NLBSecurityGroupMode = Managed
kind: ConfigMap
metadata:
creationTimestamp: "2026-01-12T19:30:23Z"
name: aws-cloud-config
namespace: e2e-clusters-lnddg-request-serving-isolation-bn5hlThe job failed caused by timeout, which now can be increased a bit due the time to delete the SGs and controller waiting for dependencies, we may need to evaluate the possibility to bump from 15' timeout to higher, here is some analysis/recommendation from claude after parsing logs: Checking if TP will behave similar the regular job. |
|
TP job e2e-aws-techpreview got stuck when creating a mgr cluster. I am investigating the root cause. -- Looks like the RCA is caused by CNO failure, cascading the other operators. CCM looks healthy |
|
I've also confirmed the mgr cluster deployment isn't impacted by ccm nlb configuration as, apparently, the nested cluster isn't enabling this feature: |
|
Action item: restore the FG assessor logic enabling the CCM config to manage SG on NLB only when TPNU feature set |
c5d785a to
15ccbab
Compare
|
feature set assessor restored on cloud-config for HC, looks like presubmit image is now using stable version on CI, re-trying with TPNU feature set: /test e2e-aws-techpreview |
|
2/5 completed. Testing e2e-aws: /test e2e-aws |
|
I see e2e-aws job is also using TPNU hardcoded, my last commit is considering the env var to set FG, re-testing: /test e2e-aws |
|
/testwith openshift/hypershift/main/e2e-conformance openshift/origin#30525 |
|
/test e2e-aws-techpreview |
6ecaf57 to
e457ca6
Compare
|
/test e2e-aws,images,security,unit,verify,verify-deps,e2e-conformance |
|
/test e2e-aws |
| if cpContext.HCP.Spec.Configuration != nil && cpContext.HCP.Spec.Configuration.FeatureGate != nil { | ||
| featureSet = cpContext.HCP.Spec.Configuration.FeatureGate.FeatureSet | ||
| } | ||
| if featureSet == configv1.TechPreviewNoUpgrade { |
There was a problem hiding this comment.
we need to check the feature gate not the set.
|
/test e2e-aws-techpreview |
47c449d to
77595c5
Compare
|
/test e2e-aws |
|
Ensuring required jobs works correctly with HCP cluster state instead of Global (change proposed by Claude exploring more options on this PR): /test e2e-aws |
|
/test unit |
…e evaluation Updates AWS CCM configuration tests to explicitly verify that the config adapter reads feature gates from HCP.Spec.Configuration rather than relying on the global operator feature set. Changes: - Set global feature gate to Default at test start to prove the adapter correctly reads from HCP configuration, not the global gate - Added explanatory comments documenting this proves the fix for the issue where global gate was incorrectly used instead of per-cluster configuration This change is critical for validating the fix to PR openshift#7460 e2e test failures where: - E2E tests create clusters with TechPreviewNoUpgrade feature set - HypershiftOperator runs with Default feature set (HYPERSHIFT_FEATURESET env) - Tests were failing because adapter checked global gate (Default) instead of cluster's configuration (TechPreviewNoUpgrade) - Result: NLBSecurityGroupMode was missing from aws-cloud-config even though cluster's feature set should enable it With this test change, we verify all scenarios work correctly: - Default feature set: NLBSecurityGroupMode NOT added - TechPreviewNoUpgrade: NLBSecurityGroupMode = Managed added - CustomNoUpgrade with explicit enable: NLBSecurityGroupMode = Managed added All tests pass, proving the adapter now correctly evaluates per-cluster feature configuration regardless of the global operator feature set. Signed-off-by: Marco Braga <mrbraga@redhat.com> Assisted-by: Claude Sonnet 4.5 (via Claude Code)
|
Update Go deps for test (broken in /test e2e-aws |
|
/test e2e-aws-techpreview |
|
Returning on the remaining failures:
Proposed a fix for both, since this feature is 4.22 only, I am limiting the e2e tests to run only 4.22 owards. The unit assets was already updated with new CCM configuration used in clusters with feature set TPNU. /test unit |
This commit updates IAM permissions to support AWS CCM managing security groups for Network Load Balancers, following least-privilege security principles. Changes: - cmd/infra/aws/iam.go: * Added 19 specific ELBv2 permissions required for NLB lifecycle management: - CreateLoadBalancer, DeleteLoadBalancer, DescribeLoadBalancers - ModifyLoadBalancerAttributes - CreateTargetGroup, DeleteTargetGroup, DescribeTargetGroups (critical - was missing) - DescribeTargetGroupAttributes, ModifyTargetGroupAttributes - DescribeTargetHealth, RegisterTargets, DeregisterTargets - CreateListener, DeleteListener, DescribeListeners - SetSecurityGroups (required for managed SG feature) - DescribeTags, AddTags, RemoveTags * Added 10 specific EC2 permissions for security group management: - CreateSecurityGroup, DeleteSecurityGroup, DescribeSecurityGroups - AuthorizeSecurityGroupIngress, RevokeSecurityGroupIngress - DescribeSubnets, DescribeVpcs, DescribeInstances - CreateTags * Added comprehensive documentation explaining each permission category and why it's needed for the CCM's NLB management functionality. Root cause: CI job failures showed "AccessDenied" for elasticloadbalancing:CreateTargetGroup. Analysis of CCM controller logs revealed the exact permissions needed for the complete NLB provisioning workflow with managed security groups. This change ensures the CCM has sufficient permissions to manage NLBs with security groups while maintaining security best practices by granting only necessary permissions.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mtulio The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test e2e-aws |
|
I will fix the lint once I get the feedback of presubmit e2e jobs. |
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
|
/test e2e-aws |
Updates AGENTS.md to reference existing comprehensive development guides instead of duplicating documentation. Changes: - AGENTS.md: * Add "Local Development Environment" section with references to: - HACKING.md for general development workflows - .claude/skills/dev/ for step-by-step development tasks * List available skills: build-ho-image, build-cpo-image, install-ho-aws, create-hc-aws, e2e-run-aws, destroy-hc-aws This avoids documentation duplication and points developers to the detailed guides that include environment setup, prerequisites, troubleshooting, and iteration workflows.
Adds comprehensive E2E testing for the AWSServiceLBNetworkSecurityGroup feature gate and fixes critical test issues. Changes: - test/e2e/util/aws_ccm.go: * Add test: verify aws-cloud-config contains NLBSecurityGroupMode=Managed * Add test: create NLB service and verify managed security group is attached * Use AWS SDK v2 to describe load balancer and validate security groups - test/e2e/create_cluster_test.go: * Add nil-safety checks in TestCreateClusterCustomConfig to prevent panics * Add missing azureutil import - test/e2e/util/util.go: * Configure feature gate with featuregates.ConfigureFeatureSet() to sync global state with cluster's feature set for proper test validation - control-plane-operator/.../testdata: * Update test fixtures for GCP and TechPreviewNoUpgrade scenarios * Add CloudController field to GCP ServiceAccountsEmails (fixes "Required value" error)
|
/test e2e-aws |
| } | ||
|
|
||
| // Add NLBSecurityGroupMode when the AWSServiceLBNetworkSecurityGroup feature gate is enabled for this cluster. | ||
| // Check the feature gate based on the cluster's configured feature gate spec, not the global operator feature set. |
There was a problem hiding this comment.
Check the feature gate based on the cluster's configured feature gate spec, not the global operator feature set.
This is important: we want to check the FG of cluster not only the global, otherwise it will not detect changes on cluster creation leading to incorrect configuration (missing cloud-config entry)
...plane-operator/controllers/hostedcontrolplane/v2/cloud_controller_manager/aws/config_test.go
Outdated
Show resolved
Hide resolved
| customApiServerHost = fmt.Sprintf("api-custom-cert-%s.%s", entryHostedCluster.Spec.InfraID, serviceDomain) | ||
| } | ||
|
|
||
| // For AWS, use ExternalDNSDomain if set, otherwise fall back to BaseDomain for local development |
There was a problem hiding this comment.
This is required when we want to make sure e2e tests runs locally, different domains than the hard coded ones - that some developers does not have access to the Account.
|
Hi @muraee , we are not getting success on CI recently, but I have verified this changes a few times here: #7460 (comment) Would you mind reviewing this PR while infra gets stable? This must be the final version addressing your feedback, with a caveat on FG verification. |
|
/retest-required |
|
@mtulio: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@mtulio: This pull request references SPLAT-2587 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Implements AWS CCM NLBSecurityGroupMode configuration with cluster-scoped feature gate evaluation, enabling managed security groups for Network Load Balancers when AWSServiceLBNetworkSecurityGroup is enabled. Changes: - control-plane-operator/featuregates/featuregates.go: * Add IsFeatureEnabledInFeatureGateSpec() helper to evaluate feature gates from FeatureGateSpec (supports Default, TechPreviewNoUpgrade, CustomNoUpgrade) * Evaluates per-cluster config instead of global operator feature set - control-plane-operator/.../aws/config.go: * Use IsFeatureEnabledInFeatureGateSpec() with HCP.Spec.Configuration.FeatureGate * Set NLBSecurityGroupMode=Managed when feature gate is enabled - control-plane-operator/.../aws/config_test.go: * Add unit tests for all feature set scenarios (Default, TechPreview, CustomNoUpgrade) * Set global feature gate to Default at test start to verify adapter reads from HCP config, not global gate * Proves the fix where adapter was incorrectly checking global HYPERSHIFT_FEATURESET instead of cluster's HCP.Spec.Configuration.FeatureGate Why cluster-scoped evaluation: The adapter was incorrectly using the global feature gate (HYPERSHIFT_FEATURESET env var) instead of reading from HCP.Spec.Configuration.FeatureGate. This caused e2e test failures where clusters with TechPreviewNoUpgrade didn't get the config because the operator was running with Default feature set. This matches the pattern used by CVO, KAS, and other components. The signature will credit: - Author: Marco Braga (you) - AI Assistant: Claude Sonnet 4.5 via Claude Code
|
@coderabbitai resume |
|
/test e2e-aws |
✅ Actions performedReviews resumed. |
|
@mtulio: This pull request references SPLAT-2587 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
What this PR does / why we need it:
This change introduces a configuration on cloud-config to instruct controller to manage Security Group on NLBs on feature set TechPreviewNoUpgrade.
This proposal is required to signalize CCM (cloud-controller-manager) to managed security group every time a NLB is requested through a Service resource.
This implement the configuration added to Self-managed (through CCCMO) in 4.21 release - also under the feature set TechPreviewNoUpgrade, and GA in 4.22.
Which issue(s) this PR fixes:
Fixes SPLAT-2587
Special notes for your reviewer:
We need to evaluate the feature gate for the hosted cluster beforehand validating globally.
Checklist:
Summary by CodeRabbit
New Features
Tests
Documentation