Skip to content

[CORE-12378] fix(QoS): Use QdiscReplace() instead of QdiscAdd()#11899

Merged
coutinhop merged 1 commit into
masterfrom
pedro-CORE-12378
Mar 2, 2026
Merged

[CORE-12378] fix(QoS): Use QdiscReplace() instead of QdiscAdd()#11899
coutinhop merged 1 commit into
masterfrom
pedro-CORE-12378

Conversation

@coutinhop
Copy link
Copy Markdown
Member

Use QdiscReplace() instead of QdiscAdd() so that adding the TBF qdiscs needed for QoS controls with tc does not error out when there is an existing non-default (handle != 0) qdisc on the interface for any reason.

Add a test case to the felix FVs to cover this.

Also, enable felix debug logging on the QoS felix FVs, and remove overzealous Skip() that was resulting in no test cases running on iptables/nftables modes.

Description

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Fix failure to enable ingress bandwidth QoS controls when a non-default qdisc previously existed on the workload interface (handle != 0).

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@coutinhop coutinhop requested a review from nelljerram February 20, 2026 23:49
@coutinhop coutinhop self-assigned this Feb 20, 2026
Copilot AI review requested due to automatic review settings February 20, 2026 23:49
@coutinhop coutinhop requested a review from a team as a code owner February 20, 2026 23:49
@coutinhop coutinhop added release-note-required Change has user-facing impact (no matter how small) docs-not-required Docs not required for this change labels Feb 20, 2026
@marvin-tigera marvin-tigera added this to the Calico v3.32.0 milestone Feb 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where Calico's QoS bandwidth controls would fail when a non-default qdisc (with handle != 0) already exists on a workload interface. The fix changes from using QdiscAdd() to QdiscReplace() in the netlink API calls, making the operation idempotent and able to handle pre-existing qdiscs. The PR also improves test coverage by adding a specific test case for this scenario and fixing an overzealous Skip() that prevented tests from running in iptables/nftables modes.

Changes:

  • Switched from QdiscAdd() to QdiscReplace() in two locations to handle pre-existing qdiscs
  • Added comprehensive test case for the non-default qdisc scenario
  • Enabled Felix debug logging and removed overzealous Skip() in QoS FV tests

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
felix/dataplane/linux/qos/qos.go Changed QdiscAdd() to QdiscReplace() in CreateEgressQdisc() and createTBF() functions to handle existing qdiscs
felix/fv/qos_controls_test.go Added test case for non-default qdisc scenario, enabled debug logging, removed Skip() that prevented iptables/nftables mode tests, added BeforeEach cleanup

Comment thread felix/fv/qos_controls_test.go Outdated
Comment thread felix/fv/qos_controls_test.go Outdated
@coutinhop coutinhop force-pushed the pedro-CORE-12378 branch 2 times, most recently from 0622308 to ed062ca Compare February 23, 2026 17:49
Copy link
Copy Markdown
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few queries/nits.


topt.DelayFelixStart = true
topt.TriggerDelayedFelixStart = true
topt.FelixLogSeverity = "Debug"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to check this change in?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, absolutely! the issue we're fixing only shows up in the debug logs, so I think it makes sense to have them in this test

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's OK to leave this in. But you only mean for debugging the problem, right? I.e. it is not the case that there was only an issue when log level was debug.

for i := range len(w) {
w[i].WorkloadEndpoint.Spec.QoSControls = nil
w[i].UpdateInInfra(infra)
Eventually(tc.Felixes[i].ExecOutputFn("ip", "r", "get", fmt.Sprintf("10.65.%d.2", i)), "10s").Should(ContainSubstring(w[i].InterfaceName))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check effective? I presume the route was already there even with the QoS in place?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I arrived at this because the route churns when the workload endpoint is updated (if my memory doesn't fail me), so I added a wait until it shows up again...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that sounds surprising. I would expect TC updates to be independent of route programming, and that a TC update would not require a route change. Might be a real problem hiding here if you really saw route churn.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's related to tc, but the FV test infra, or maybe I was too zealous when writing those into every workload update, I'll merge this PR and investigate separately

Comment thread felix/dataplane/linux/qos/qos.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createTBF and updateTBF are now identical. Can we deduplicate them?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching it! done

Use QdiscReplace() instead of QdiscAdd() so that adding the TBF
qdiscs needed for QoS controls with tc does not error out when
there is an existing non-default (handle != 0) qdisc on the
interface for any reason.

Add a test case to the felix FVs to cover this.

Also, enable felix debug logging on the QoS felix FVs, and
remove overzealous Skip() that was resulting in no test cases
running on iptables/nftables modes.
Copy link
Copy Markdown
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I would appreciate digging a bit more on the possible route churn, in case there is a real problem there.

@coutinhop coutinhop merged commit 9afd2b5 into master Mar 2, 2026
3 checks passed
@coutinhop coutinhop deleted the pedro-CORE-12378 branch March 2, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants