Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential significant max network throughput performance regression #4051

Open
njgibbon opened this issue Jun 11, 2024 · 3 comments
Open

Potential significant max network throughput performance regression #4051

njgibbon opened this issue Jun 11, 2024 · 3 comments
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@njgibbon
Copy link

Hello,

Context
I work in the K8s platform team at Nationwide Building Society in the UK. We have been using EKS for ~4 years now and currently have around ~200 clusters in the estate across production and non-production. We use both self-built AL2 EKS worker nodes and we consume the Bottlerocket public AMIs. Before every release of our internal platform we run a variety of NFR benchmarking and sanity checking tools. This includes the k8s project netperf test suite: https://github.com/kubernetes/perf-tests/tree/master/network/benchmarks/netperf

Problem
Recently we found the tests suddenly reporting a significant drop in metrics for Bottlerocket. Take the below results from one single run which are representative of all we've seen:

If we can just focus on 2 tests. Pod to pod traffic on the same node and across nodes (mbps).

Test v1.19.3 (avg) v1.19.4 (avg) Delta
1 iperf TCP. Same VM using Pod IP 22,455 14,847 ~33%
3 iperf TCP. Remote VM using Pod IP 5,394 3,687 ~31%

In my case I've managed to narrow down the problem with some version details. I've found that all versions of Bottlerocket at v1.19.4 and above exhibit the issue. All versions of Bottlerocket at v1.19.3 and below do not have the issue.

I've done around ~20 test runs at the later versions of Bottlerocket but we also have 100s of test reports stored from various versions of Bottlerocket in the past ~2 years which show complete consistency without meaningful deviation up until v1.19.4. We also have the control of our self-built AL2 which has not seen a drop.

Our testing is always done on a 5 ON_DEMAND m5a.xlarge node cluster.

Checking
Looking in to the changes unless something is hidden or an accidental bug is introduced at some layer of the stack the only likely offender I can see is this: All kernels: remove network scheduling "Class Based Queuing" and "Differentiated Services Marker," formerly loadable kmods. - #3865

Impact
I don't have any data to suggest this will have a negative impact for our workloads at the moment. We are not usually anywhere near network throughput capacity on any given node and I have no data to suggest this slows down traffic more generally. But the results are significant and I think it would be important to look in to this for users overall.

Possibilities

  • This is a genuine max network throughput performance regression which might have other network performance implications.
  • Some change in the Bottlerocket system means that it has a different interaction with the iperf3 tool giving different results but performance has not actually regressed.
  • Something else.

Ask
Please can the Wizards 🧙 on the Amazon / Bottlerocket project side have a look in to this. I think that you will have better tools / efficiency / domain knowledge to confirm / falsify / explain what is going on to progress this.

Other
From what I've looked at so far this should be replicable with any version of EKS. To replicate the higher performance results you will need to use a version where v1.19.3 and below AMIs are available but you may also have other ways of replicating things more efficiently. I think raw iperf3 tests between pods should be sufficient to show results of the same order or otherwise help falsify.

If you try to run the same netperf tooling then be aware that you will need to mess with the config to remove some incorrect arguments provided to the container which don't align with the version of the program and prevent it from running by default. https://github.com/kubernetes/perf-tests/blob/master/network/benchmarks/netperf/launch.go#L268

@njgibbon njgibbon added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Jun 11, 2024
@larvacea
Copy link
Member

Thanks for reporting this, and in particular, thanks for laying out your testing and results so clearly.

@larvacea
Copy link
Member

Just to keep you up to date: I have one higher-priority issue consuming my attention, but this is in my queue and I haven't forgotten about it.

@njgibbon
Copy link
Author

@larvacea thank you no worries. At this point I'm just keen to see if someone else can see the same sort of things that I've seen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants