Skip to content

e2e/loadbalancer: added hairpin connection cases#1161

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
mtulio:e2e-hairpin
Jul 17, 2025
Merged

e2e/loadbalancer: added hairpin connection cases#1161
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
mtulio:e2e-hairpin

Conversation

@mtulio
Copy link
Copy Markdown
Contributor

@mtulio mtulio commented Jun 16, 2025

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

Implementing the hairpin connection test cases, and exposing an issue on NLB with internal scheme which fails when the client is trying to access a service loadbalancer which is hosted in the same node.

The hairpin connection is caused by the client IP preservation attribute is set to true (default), and the service does not provide an interface to prevent the issue.

The e2e is expecting to pass to prevent permanent failures in CI, but it is tracked by an issue #1160.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Those tests are important to increase coverage of scenarios that CCM declares as supported.

I also believe we can remove the hairpin with scheme internet-facing (public) LBs because the source IPs would be traversing a VPC gateway (IGW/NGW) and masquerade the real source, not reproducing the problem we are trying to expose in #1160. Thoughts?

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 16, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 16, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 16, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 16, 2025
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jun 16, 2025

Hi @kmala , would you mind stamping ok to test those new jobs, please?

I also have a question in the description related to the public cases, I used in the begging of exploration, but looks like we don't need it, LMK WDYT. Thanks

cc @elmiko

@mtulio mtulio changed the title e2e/loadbalancer: implement hairpin connection cases e2e/loadbalancer: added hairpin connection cases Jun 16, 2025
@kmala
Copy link
Copy Markdown
Member

kmala commented Jun 16, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 16, 2025
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jun 16, 2025

@mtulio: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-aws-e2e 24a0041 link true /test pull-cloud-provider-aws-e2e
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

CI infra issue.

/test pull-cloud-provider-aws-e2e

@mtulio mtulio force-pushed the e2e-hairpin branch 2 times, most recently from a623f3d to 5099f7f Compare June 16, 2025 16:07
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jun 16, 2025

I am observing a permanent failure on CI when launching the cluster trying to use an image that is no longer available:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1161/pull-cloud-provider-aws-e2e/1934643980113285120#1:build-log.txt%3A751

s invalid: could not find Image for "099720109477/ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20250502.1"

Hi @kmala @elmiko, do you know if is this comes from the test framework or is it possible to use a valid image in CCM repo?

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jun 17, 2025

An issue has been opened to track the CI problem: #1167

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jun 18, 2025

/assign @elmiko @kmala

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jun 18, 2025

looks like pull-cloud-provider-aws-e2e is running (and stuck) in the last 48h. Just stopped it and trying to run again:

/test pull-cloud-provider-aws-e2e

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jun 19, 2025

/test pull-cloud-provider-aws-e2e-kubetest2

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jun 20, 2025

Looks like #1167 has been resolved. I manually stopped the running job (42h+); Triggering it again:

/test pull-cloud-provider-aws-e2e

Copy link
Copy Markdown
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking mostly good, i just have a question about the global variables.

@mtulio mtulio force-pushed the e2e-hairpin branch 2 times, most recently from 2120cf7 to e7d0731 Compare July 7, 2025 20:13
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jul 7, 2025

looking mostly good, i just have a question about the global variables.

Thanks, @elmiko, good suggestions. Fixed.

Copy link
Copy Markdown
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Marco, i think that makes it a little less fragile.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 7, 2025
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jul 11, 2025

Hey @cartermckinnon , would you mind taking a look at this e2e test improvements to help us troubleshooting known issues? This won't fix #1160, but help us test it. Thanks!

@kmala
Copy link
Copy Markdown
Member

kmala commented Jul 14, 2025

@oliviassss can you take a look?

@oliviassss
Copy link
Copy Markdown
Contributor

oliviassss commented Jul 15, 2025

@mtulio Hi, thanks for the contribution. Just for my understanding, this PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test? What's the main purpose for adding the test cases?

From the AWS ELB doc I think the internal NLB will have hairpin issue with UDP or TCP_UDP protocol since the preserve client IP attribute cannot be disabled. (and I don't think CCM provide an annotation to disable this TG attributes anyway)

By default, client IP preservation is enabled (and can't be disabled) for instance and IP type target groups with UDP and TCP_UDP protocols. However, you can enable or disable client IP preservation for TCP and TLS target groups using the preserve_client_ip.enabled target group attribute.

ref: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/edit-target-group-attributes.html#client-ip-preservation

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jul 15, 2025

@mtulio Hi, thanks for the contribution. Just for my understanding, this PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test? What's the main purpose for adding the test cases?

From the AWS ELB doc I think the internal NLB will have hairpin issue with UDP or TCP_UDP protocol since the preserve client IP attribute cannot be disabled. (and I don't think CCM provide an annotation to disable this TG attributes anyway)

By default, client IP preservation is enabled (and can't be disabled) for instance and IP type target groups with UDP and TCP_UDP protocols. However, you can enable or disable client IP preservation for TCP and TLS target groups using the preserve_client_ip.enabled target group attribute.

ref: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/edit-target-group-attributes.html#client-ip-preservation

Hi @oliviassss , answering your questions:

his PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test?

Correct, we are skipping only the NLB test case as it works on CLB. https://github.com/kubernetes/cloud-provider-aws/pull/1161/files#diff-05b1c14f2de829d8a0c5f65b1b492a9ed9ab9d100ce6daa89d2d2347c8a14c77R122-R160

What's the main purpose for adding the test cases?

The main purpose is to expose the test scenario (hairpin connection) for both supported Load Balancer by CCM: CLB and NLB, skipping the NLB case to prevent know failures on CCM CI.

This problem was unknown until now, we are exposing this scenario as "known issue" to CCM in #1160, and using this e2e as a helper to reproduce and fix in follow up PRs.

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Jul 16, 2025

Hi @oliviassss @kmala , would you mind also triage the related issue #1160, please?

Implementing the hairpin connection test cases, and exposing an issue on
NLB with internal scheme which fails when the client is trying
to access a service loadbalancer which is hosted in the same node.

The hairpin connection is caused by the client IP preservation attribute
is set to true (default), and the service does not provide an interface
to prevent the issue.

The e2e is expecting to pass to prevent permanent failures in CI, but it
is tracked by an issue kubernetes#1160.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 16, 2025
@kmala
Copy link
Copy Markdown
Member

kmala commented Jul 17, 2025

/approve

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kmala, oliviassss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 17, 2025
@elmiko
Copy link
Copy Markdown
Contributor

elmiko commented Jul 17, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2025
@k8s-ci-robot k8s-ci-robot merged commit b80f44a into kubernetes:master Jul 17, 2025
11 checks passed
k8s-ci-robot added a commit that referenced this pull request Jan 21, 2026
-#1215-#1217-#1214-upstream-release-1.32

Automated cherry pick of #1153: e2e/deps: enhance test scenarios with NLB
#1161: e2e/loadbalancer: implement hairpin connection cases
#1215: refact: e2e tests documenting hooks and enhance logging/steps
#1217: e2e/debug: increase data collection on e2e failures
#1214: doc/service: describe supported target group attributes
k8s-ci-robot added a commit that referenced this pull request Jan 21, 2026
-#1215-#1217-#1214-upstream-release-1.33

Automated cherry pick of #1153: e2e/deps: enhance test scenarios with NLB
#1161: e2e/loadbalancer: implement hairpin connection cases
#1215: refact: e2e tests documenting hooks and enhance logging/steps
#1217: e2e/debug: increase data collection on e2e failures
#1214: doc/service: describe supported target group attributes
k8s-ci-robot added a commit that referenced this pull request Jan 21, 2026
-#1215-#1217-#1214-upstream-release-1.31

Automated cherry pick of #1153: e2e/deps: enhance test scenarios with NLB
#1161: e2e/loadbalancer: implement hairpin connection cases
#1215: refact: e2e tests documenting hooks and enhance logging/steps
#1217: e2e/debug: increase data collection on e2e failures
#1214: doc/service: describe supported target group attributes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants