[CORE-12378] fix(QoS): Use QdiscReplace() instead of QdiscAdd()#11899
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where Calico's QoS bandwidth controls would fail when a non-default qdisc (with handle != 0) already exists on a workload interface. The fix changes from using QdiscAdd() to QdiscReplace() in the netlink API calls, making the operation idempotent and able to handle pre-existing qdiscs. The PR also improves test coverage by adding a specific test case for this scenario and fixing an overzealous Skip() that prevented tests from running in iptables/nftables modes.
Changes:
- Switched from
QdiscAdd()toQdiscReplace()in two locations to handle pre-existing qdiscs - Added comprehensive test case for the non-default qdisc scenario
- Enabled Felix debug logging and removed overzealous Skip() in QoS FV tests
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| felix/dataplane/linux/qos/qos.go | Changed QdiscAdd() to QdiscReplace() in CreateEgressQdisc() and createTBF() functions to handle existing qdiscs |
| felix/fv/qos_controls_test.go | Added test case for non-default qdisc scenario, enabled debug logging, removed Skip() that prevented iptables/nftables mode tests, added BeforeEach cleanup |
0622308 to
ed062ca
Compare
|
|
||
| topt.DelayFelixStart = true | ||
| topt.TriggerDelayedFelixStart = true | ||
| topt.FelixLogSeverity = "Debug" |
There was a problem hiding this comment.
Do you mean to check this change in?
There was a problem hiding this comment.
yes, absolutely! the issue we're fixing only shows up in the debug logs, so I think it makes sense to have them in this test
There was a problem hiding this comment.
I guess it's OK to leave this in. But you only mean for debugging the problem, right? I.e. it is not the case that there was only an issue when log level was debug.
| for i := range len(w) { | ||
| w[i].WorkloadEndpoint.Spec.QoSControls = nil | ||
| w[i].UpdateInInfra(infra) | ||
| Eventually(tc.Felixes[i].ExecOutputFn("ip", "r", "get", fmt.Sprintf("10.65.%d.2", i)), "10s").Should(ContainSubstring(w[i].InterfaceName)) |
There was a problem hiding this comment.
Is this check effective? I presume the route was already there even with the QoS in place?
There was a problem hiding this comment.
I believe I arrived at this because the route churns when the workload endpoint is updated (if my memory doesn't fail me), so I added a wait until it shows up again...
There was a problem hiding this comment.
Hmm, that sounds surprising. I would expect TC updates to be independent of route programming, and that a TC update would not require a route change. Might be a real problem hiding here if you really saw route churn.
There was a problem hiding this comment.
I don't think it's related to tc, but the FV test infra, or maybe I was too zealous when writing those into every workload update, I'll merge this PR and investigate separately
There was a problem hiding this comment.
createTBF and updateTBF are now identical. Can we deduplicate them?
There was a problem hiding this comment.
thanks for catching it! done
ed062ca to
bca0441
Compare
Use QdiscReplace() instead of QdiscAdd() so that adding the TBF qdiscs needed for QoS controls with tc does not error out when there is an existing non-default (handle != 0) qdisc on the interface for any reason. Add a test case to the felix FVs to cover this. Also, enable felix debug logging on the QoS felix FVs, and remove overzealous Skip() that was resulting in no test cases running on iptables/nftables modes.
bca0441 to
c4053db
Compare
nelljerram
left a comment
There was a problem hiding this comment.
Looks good but I would appreciate digging a bit more on the possible route churn, in case there is a real problem there.
Use QdiscReplace() instead of QdiscAdd() so that adding the TBF qdiscs needed for QoS controls with tc does not error out when there is an existing non-default (handle != 0) qdisc on the interface for any reason.
Add a test case to the felix FVs to cover this.
Also, enable felix debug logging on the QoS felix FVs, and remove overzealous Skip() that was resulting in no test cases running on iptables/nftables modes.
Description
Related issues/PRs
Todos
Release Note
Reminder for the reviewer
Make sure that this PR has the correct labels and milestone set.
Every PR needs one
docs-*label.docs-pr-required: This change requires a change to the documentation that has not been completed yet.docs-completed: This change has all necessary documentation completed.docs-not-required: This change has no user-facing impact and requires no docs.Every PR needs one
release-note-*label.release-note-required: This PR has user-facing changes. Most PRs should have this label.release-note-not-required: This PR has no user-facing changes.Other optional labels:
cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.