Skip to content

[client] Try next DNS upstream on SERVFAIL/REFUSED responses#5163

Merged
lixmal merged 1 commit intomainfrom
dns-upstream-failover
Jan 23, 2026
Merged

[client] Try next DNS upstream on SERVFAIL/REFUSED responses#5163
lixmal merged 1 commit intomainfrom
dns-upstream-failover

Conversation

@lixmal
Copy link
Copy Markdown
Collaborator

@lixmal lixmal commented Jan 23, 2026

Describe your changes

Previously, we would only try the next upstream server if there was no response/timeout or any other query error.
Instead, we should also try the next upstream if the queried server reported issues.
This PR also improves logging for upstream failures.

Example when one fails (end result = success):


2026-01-23T10:25:34Z WARN [request_id: 06c63d15, dns_id: 8af7] client/internal/dns/upstream.go:239: 1/2 upstreams failed for domain=test.servfail.: 127.0.0.1:5353=SERVFAIL
2026-01-23T10:25:34Z TRAC [request_id: 06c63d15, dns_id: 8af7] client/internal/dns/handler_chain.go:264: response: domain=test.servfail. rcode=NOERROR answers=[192.0.2.200] upstream=127.0.0.1:5354 took=1.393757ms

Example when all fail:


2026-01-23T10:28:29Z ERRO [request_id: 3d5d0b4f, dns_id: 6468] client/internal/dns/upstream.go:241: 2/2 upstreams failed for domain=test.servfail.: 127.0.0.1:5353=SERVFAIL, 127.0.0.1:5354=SERVFAIL
2026-01-23T10:28:29Z TRAC [request_id: 3d5d0b4f, dns_id: 6468] client/internal/dns/handler_chain.go:264: response: domain=test.servfail. rcode=SERVFAIL answers=[] took=1.267542ms
█

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced DNS request logging with additional contextual information for improved debugging.
    • Improved upstream server failover logic with per-server failure tracking and detailed failure summaries.
  • Bug Fixes

    • DNS error responses now only returned when all upstream servers fail, preventing premature error responses.
  • Tests

    • Added comprehensive test coverage for DNS failover scenarios and failure handling.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

The changes add enhanced DNS logging with dns_id context, introduce per-upstream failure tracking via a new upstreamFailure type, modify upstream request handling to aggregate failures across servers, and expand test coverage for failover scenarios without altering overall control flow.

Changes

Cohort / File(s) Summary
DNS Logging Enhancement
client/internal/dns/local/local.go
Extended logger initialization to include both request_id and dns_id fields, with dns_id derived from the DNS message ID formatted as 4-digit hexadecimal.
Upstream Failure Tracking & Error Handling
client/internal/dns/upstream.go
Introduced upstreamFailure type to track per-upstream errors; refactored tryUpstreamServers to aggregate failures and return both success flag and failure slice; modified queryUpstream and handleUpstreamError to return *upstreamFailure; added logUpstreamFailures to emit aggregated failure summaries; updated writeErrorResponse with domain-focused logging.
Upstream Failover Test Coverage
client/internal/dns/upstream_test.go
Added per-server mock resolver (mockUpstreamResolverPerServer) and tracking wrapper to test failover scenarios; introduced TestUpstreamResolver_Failover covering success on first upstream, server failures with subsequent success, timeouts, NXDOMAIN behavior, and combined failure modes; added TestUpstreamResolver_SingleUpstreamFailure and helper functions buildMockResponse and TestFormatFailures.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Local as Local DNS<br/>(upstream.go)
    participant U1 as Upstream 1
    participant U2 as Upstream 2
    participant Logger

    Client->>Local: DNS Query (request_id, dns_id)
    
    rect rgba(100, 150, 200, 0.5)
    Note over Local: tryUpstreamServers:<br/>Aggregate failures
    Local->>U1: Query
    alt U1 Success
        U1-->>Local: Response
        Local-->>Client: Return Response
    else U1 Failure
        U1-->>Local: Error/Timeout
        Local->>Local: Record upstreamFailure
        Local->>U2: Query
        alt U2 Success
            U2-->>Local: Response
            Local-->>Client: Return Response
        else U2 Failure
            U2-->>Local: Error/Timeout
            Local->>Local: Record upstreamFailure
            Local->>Logger: logUpstreamFailures<br/>(summarize all failures)
            Logger-->>Local: Ack
            Local->>Local: writeErrorResponse
            Local-->>Client: Error Response
        end
    end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • pappz

Poem

🐰 A rabbit hops through DNS logs,
With dns_id and request_id tags,
Failovers dance from server to server,
No more silent drops—each failure logged,
Smart upstream tracking makes DNS clever! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[client] Try next DNS upstream on SERVFAIL/REFUSED responses' directly describes the main change: modifying DNS upstream failover behavior to retry on SERVFAIL/REFUSED responses.
Description check ✅ Passed The description covers the main changes, includes examples of improved logging output, marks relevant checklist items (bug fix, tests created), and addresses documentation selection. However, the 'Issue ticket number and link' and 'Docs PR URL' sections lack actual content.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

@lixmal lixmal merged commit 269d5d1 into main Jan 23, 2026
37 of 40 checks passed
@lixmal lixmal deleted the dns-upstream-failover branch January 23, 2026 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants