-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cilium_host: hw csum failure #9482
Comments
Could not reproduce on Azure so far. Tested with non-SRIOV and SRIOV networking. Need more info. |
Since this is on Azure and related to hardware checksums, this may be of interest. coreos/tectonic-installer#1171 (comment) Basically it was discovered that some subset of VMs provisioned in azure (multiple datacenters) have defects in the hardware checksum implementation. CoreOS at the time developed a test procedure to recreate VMs enough times to finally hit a faulty instance and then iterate on a solution. They had a repository or gist for this, but I didn’t find it just from the issue or relayed MR. The final solution was to disable the hardware checksum for all VMs in azure. I run with that setting to this day in Azure. Maybe a useful insight too about VMs in Azure and checksum offloading, flannel-io/flannel#790 (comment) |
@cehoffman interesting, thanks a lot for the pointers! Dissection of above linear data is:
So this must have come from simple cilium health check. |
Closing for now due to inactivity and lack of reproduction / more debugging data. If this is still an issue, we'll reopen and reinvestigate. |
Hi,
Disabling TX checksum on the cilium_host device as suggested in the flannel issue seems to be able to fix the problem. |
If you need me to provide more information let me know. |
@knfoo Do you have a sysdump (or alternatively could you post the full config you are using + kernel + cilium version), and describe the issue in more detail when it is occuring, e.g. what path is the packet traversing etc? The more info we have on this the better we could track down the issue. Thanks for your help! |
@borkmann Sure I will come with as much information that I can. First I am new to Cilium - been a calico user for a long time. I am using metal3.io + clusterctl to deploy my baremetal k8s clusters. The integration is using kubeadm in the backend. I am currently running k8s 1.20.2, ubuntu 20.04, linux-image-5.4.0-77-generic 5.4.0-77.78 I was following the https://docs.cilium.io/en/v1.9/gettingstarted/k8s-install-default/ guide to install cilium as I wanted to get something running kind of fast so that I could play around with. I did not do anything particular to trigger the error - I am assuming that it is the health check that triggers it. The k8s cluster is stale and is not running any kind of workload when it happened.
|
My team began seeing this issue across our Azure fleet 30 days ago as well. Is there an update on this ticket? |
@kkourt are you still planning to work on this? |
I'll comment that this can also be seen on Equinix Metal c3.small.x86 machines that are based on the Supermicro X11SCM-F. It is NOT present on the c3.small.x86 machines based on ASRockRack E3C246D4I-NL. Supermicro:
ASRockRack:
|
@errordeveloper it's on my list, but I have not managed to reproduce it yet. |
Issue should be fixed by: #16604 |
Reported via slack: https://cilium.slack.com/archives/C1MATJ5U5/p1571823875167100
The text was updated successfully, but these errors were encountered: