Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inadvertent txqueuelen being set to zero #1100

Merged
merged 1 commit into from
Oct 2, 2024

Conversation

gudmundur
Copy link
Contributor

When diagnosing networking issues, I noticed that txqueuelen on veth pairs and tap devices was set to zero. In investigating the issue, I found that most usages of netlink.LinkAttrs in this repository uses the struct literal, which in the netlink docs is advised against unless TxQLen is explicitly set.

Note NewLinkAttrs constructor, it sets default values in structure. For now it sets only TxQLen to -1, so kernel will set default by itself. If you're using simple initialization(LinkAttrs{Name: "foo"}) TxQLen will be set to 0 unless you specify it like LinkAttrs{Name: "foo", TxQLen: 1000}.
(link)

When using the ip command directly without specifying txqueuelen, Linux will automatically to 1000:

$ ip tuntap add tap0 mode tap
$ ip link show tap0
4: tap0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 8a:06:b2:30:4d:1e brd ff:ff:ff:ff:ff:ff

This change goes through every single usage of netlink.LinkAttrs{ and replaces it with the netlink.NewLinkAttrs constructor, and then overrides the individual fields in the struct.

I even noticed that in the bridge plugin, this is somewhat called out when using the literal. I've replaced this explicitly in d376c8d though.

@gudmundur
Copy link
Contributor Author

Been trying to sort out what's up with the failing sbr test and have been unable to reproduce. I see that it's also failing on another recent pull request (#1099, failing test). Happy to provide a fix if I had a way to reproduce the failure. 😭

@squeed
Copy link
Member

squeed commented Sep 26, 2024

Thanks for the PR. Have you seen #1097? Are these two PRs duplicate or complementary?

@gudmundur
Copy link
Contributor Author

@squeed The second issue #1097 describes is somewhat related, but it misses the mark on setting it to zero meaning that the kernel will pick a default.

This can be illustrated as such:

$ ip link add dev vm1 type veth
$ ip link show vm1
... snip snip snip ...
20: vm1@vm2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 76:8b:12:5d:e9:c0 brd ff:ff:ff:ff:ff:ff

Note the qlen 1000 on the interface. Now if we run the same command, but set txqueuelen 0 the following happens:

$ ip link add dev vm1 txqueuelen 0 type veth
$ ip link show vm1
24: vm1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default 
    link/ether 6a:19:4e:0b:3a:5e brd ff:ff:ff:ff:ff:ff

Note the missing qlen 1000. If you look at any regular device that is created by ip they will always have a qlen 1000. The netlink library propagates the 0 to the syscall when creating the link. I remember before fixing this, that we would see a handful of these log lines in the kernel logs. In digging a little through the Linux codebase it seems that a value of zero is caught in some places. In Googling this a little more, I also found that the Cilium folks had this issue in the past.

What I did notice on our end when we worked around this issue, was that we saw less tx packet drops. Now I can't remember the specifics on whether that was on veth or tap.

@squeed
Copy link
Member

squeed commented Sep 30, 2024

(@LionelJouin found the cause of the flake, he should have a fix soon)

@squeed squeed self-requested a review September 30, 2024 15:12
Copy link
Member

@squeed squeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Mind if I squash this in to a single commit?

@squeed
Copy link
Member

squeed commented Sep 30, 2024

Rebased to pick up SBR fixes.

TxQLen was unintentionally set to 0 due to a struct literal.

Signed-off-by: Gudmundur Bjarni Olafsson <[email protected]>
@gudmundur
Copy link
Contributor Author

@squeed Took care of the squashing for you. Reworded the commit message for a single commit as well.

@squeed squeed merged commit 3a49cff into containernetworking:main Oct 2, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants