Skip to content

[client] Fix stale entries in nftables with no handle#5272

Merged
lixmal merged 1 commit intomainfrom
fix-nftables-stale-removal
Feb 12, 2026
Merged

[client] Fix stale entries in nftables with no handle#5272
lixmal merged 1 commit intomainfrom
fix-nftables-stale-removal

Conversation

@lixmal
Copy link
Copy Markdown
Collaborator

@lixmal lixmal commented Feb 6, 2026

Describe your changes

Summary

  • Fix refreshRulesMap() to rebuild the rules map from scratch instead of merging, so stale entries (Handle == 0) from failed flushes are naturally dropped
  • Add Handle == 0 guards to all delete paths (DeleteRouteRule, removeNatRule, removeLegacyRouteRule, DeleteDNATRule, RemoveInboundDNAT) that clean up stale entries instead of returning errors
  • Implement rollbackRules() for AddNatRule flush failures to clean up unflushed entries and decrement set counters
  • Change RemoveNatRule from fail-fast to error accumulation for cleanup resilience

Problem

When conn.Flush() fails in AddNatRule, rules are stored in the map with Handle == 0 (the kernel never assigned handles). The old refreshRulesMap() merged kernel rules into the existing map, so these stale entries persisted
indefinitely. Any subsequent delete operation would fail on Handle == 0, which blocked further operations like re-adding the same NAT rule.

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Bug Fixes

    • Improved firewall rule resilience: stale kernel entries are now detected and cleaned without causing errors, reducing failed operations.
    • Better cleanup and rollback for partially-applied NAT/route changes, preventing orphaned state and unnecessary flushes.
    • Enhanced logging for stale/missing rules and aggregated error reporting to reduce noisy failures.
  • Tests

    • Added tests covering stale-rule detection, cleanup, and NAT/route resilience.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

This change broadens nftables router resilience: it treats stale kernel rules (Handle == 0) as non-fatal, prunes them from in-memory state, introduces rollbackRules for cleanup on failed flushes, aggregates deletion errors with multierror, and updates refreshRulesMap to rebuild the rule map from kernel state. (≤50 words)

Changes

Cohort / File(s) Summary
Router logic & resilience
client/firewall/nftables/router_linux.go
Handle stale nftables entries (Handle==0) by logging warnings, pruning in-memory maps, and returning nil instead of errors; add rollbackRules helper to rollback unflushed NAT/forwarding additions; use multierror to aggregate deletion errors; conditionalize flushes (only when needed); improve logging and comments.
Tests for stale-entry behavior
client/firewall/nftables/router_linux_test.go
Add tests covering stale-handle scenarios: refresh map removes stale entries, DeleteRouteRule succeeds on stale entries, and AddNatRule succeeds when internal entries are stale; import ACL id helper used in tests.

Sequence Diagram(s)

(omitted — changes remain primarily internal to router logic and kernel state handling; conditions for sequence diagrams not met)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • pappz

Poem

"🐰 I hopped through nft rules, sniffed a stale,

Pruned the map tidy, left no trail,
Rolled back the bits that failed to flush,
Warnings not panic — no need to rush,
A happy router, neat and hale."

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the problem, solution, and includes the completed bug-fix checkbox, but missing required issue ticket number/link and docs PR URL fields. Add the issue ticket number and link, and complete the docs PR URL field (or remove placeholder if not applicable).
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main fix: addressing stale nftables entries that lack handles.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-nftables-stale-removal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/firewall/nftables/router_linux.go (1)

993-1012: ⚠️ Potential issue | 🟡 Minor

RemoveAllLegacyRouteRules does not handle Handle==0 entries.

All other delete paths (DeleteRouteRule, removeNatRule, removeLegacyRouteRule, DeleteDNATRule, RemoveInboundDNAT) now guard against stale entries with Handle==0. However, RemoveAllLegacyRouteRules at line 1004 calls r.conn.DelRule(rule) directly without checking rule.Handle == 0, which would fail for stale entries just like the original bug this PR addresses.

Proposed fix
 	for k, rule := range r.rules {
 		if !strings.HasPrefix(k, firewall.ForwardingFormatPrefix) {
 			continue
 		}
+		if rule.Handle == 0 {
+			log.Warnf("legacy forwarding rule %s has no handle, removing stale entry", k)
+			delete(r.rules, k)
+			continue
+		}
 		if err := r.conn.DelRule(rule); err != nil {
 			merr = multierror.Append(merr, fmt.Errorf("remove legacy forwarding rule: %v", err))
 		} else {
 			delete(r.rules, k)
 		}

@lixmal lixmal force-pushed the fix-nftables-stale-removal branch from 6a53884 to ca101f8 Compare February 6, 2026 19:11
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Feb 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
23.9% Duplication on New Code (required ≤ 10%)

See analysis details on SonarQube Cloud

@lixmal lixmal merged commit 3dfa97d into main Feb 12, 2026
37 of 39 checks passed
@lixmal lixmal deleted the fix-nftables-stale-removal branch February 12, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants