Potential significant max network throughput performance regression #4051

njgibbon · 2024-06-11T21:21:27Z

Hello,

Context
I work in the K8s platform team at Nationwide Building Society in the UK. We have been using EKS for ~4 years now and currently have around ~200 clusters in the estate across production and non-production. We use both self-built AL2 EKS worker nodes and we consume the Bottlerocket public AMIs. Before every release of our internal platform we run a variety of NFR benchmarking and sanity checking tools. This includes the k8s project netperf test suite: https://github.com/kubernetes/perf-tests/tree/master/network/benchmarks/netperf

Problem
Recently we found the tests suddenly reporting a significant drop in metrics for Bottlerocket. Take the below results from one single run which are representative of all we've seen:

If we can just focus on 2 tests. Pod to pod traffic on the same node and across nodes (mbps).

Test	v1.19.3 (avg)	v1.19.4 (avg)	Delta
1 iperf TCP. Same VM using Pod IP	22,455	14,847	~33%
3 iperf TCP. Remote VM using Pod IP	5,394	3,687	~31%

In my case I've managed to narrow down the problem with some version details. I've found that all versions of Bottlerocket at v1.19.4 and above exhibit the issue. All versions of Bottlerocket at v1.19.3 and below do not have the issue.

I've done around ~20 test runs at the later versions of Bottlerocket but we also have 100s of test reports stored from various versions of Bottlerocket in the past ~2 years which show complete consistency without meaningful deviation up until v1.19.4. We also have the control of our self-built AL2 which has not seen a drop.

Our testing is always done on a 5 ON_DEMAND m5a.xlarge node cluster.

Checking
Looking in to the changes unless something is hidden or an accidental bug is introduced at some layer of the stack the only likely offender I can see is this: All kernels: remove network scheduling "Class Based Queuing" and "Differentiated Services Marker," formerly loadable kmods. - #3865

Impact
I don't have any data to suggest this will have a negative impact for our workloads at the moment. We are not usually anywhere near network throughput capacity on any given node and I have no data to suggest this slows down traffic more generally. But the results are significant and I think it would be important to look in to this for users overall.

Possibilities

This is a genuine max network throughput performance regression which might have other network performance implications.
Some change in the Bottlerocket system means that it has a different interaction with the iperf3 tool giving different results but performance has not actually regressed.
Something else.

Ask
Please can the Wizards 🧙 on the Amazon / Bottlerocket project side have a look in to this. I think that you will have better tools / efficiency / domain knowledge to confirm / falsify / explain what is going on to progress this.

Other
From what I've looked at so far this should be replicable with any version of EKS. To replicate the higher performance results you will need to use a version where v1.19.3 and below AMIs are available but you may also have other ways of replicating things more efficiently. I think raw iperf3 tests between pods should be sufficient to show results of the same order or otherwise help falsify.

If you try to run the same netperf tooling then be aware that you will need to mess with the config to remove some incorrect arguments provided to the container which don't align with the version of the program and prevent it from running by default. https://github.com/kubernetes/perf-tests/blob/master/network/benchmarks/netperf/launch.go#L268

The text was updated successfully, but these errors were encountered:

larvacea · 2024-06-11T21:57:05Z

Thanks for reporting this, and in particular, thanks for laying out your testing and results so clearly.

larvacea · 2024-06-13T14:49:53Z

Just to keep you up to date: I have one higher-priority issue consuming my attention, but this is in my queue and I haven't forgotten about it.

njgibbon · 2024-06-13T16:04:35Z

@larvacea thank you no worries. At this point I'm just keen to see if someone else can see the same sort of things that I've seen.

njgibbon added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential significant max network throughput performance regression #4051

Potential significant max network throughput performance regression #4051

njgibbon commented Jun 11, 2024

larvacea commented Jun 11, 2024

larvacea commented Jun 13, 2024

njgibbon commented Jun 13, 2024

Potential significant max network throughput performance regression #4051

Potential significant max network throughput performance regression #4051

Comments

njgibbon commented Jun 11, 2024

larvacea commented Jun 11, 2024

larvacea commented Jun 13, 2024

njgibbon commented Jun 13, 2024