feat(cloudflare): support batch API for DNS record changes#6208
Conversation
|
Welcome @mooseracerPT! |
|
Hi @mooseracerPT. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
I'm not sure about the fix yet. What we should be doing
For all the changes, we expect confirmations of changes working #5085 (comment) |
|
And you need to sing EasyCLA |
|
How do you handle errors from the API? Last time I tried, the API only returned the first error and did not apply the batch. |
That's still the case, the entire batch transaction is rolled back and goes in We could try falling back to individual one-by-one API calls in the event of a failed batch, but it won't be pretty. I also need to do some more with the chunking logic to handle edge cases where one chunk succeeds but another fails, and they've split multiple operations on the same record(s) between them. |
aeab3a7 to
167796d
Compare
|
@ivankatliarchuk / @vflaux I've made some significant changes to prevent a bad record config from blocking updates to the whole zone. And I added the flag for The previous build has been running in my own busy cluster without issue, and I'll promote this one tomorrow. |
ivankatliarchuk
left a comment
There was a problem hiding this comment.
Docs needs a review + --batch-change-interval is required.
|
/ok-to-test |
|
Still draft PR |
|
/retitle feat(cloudflare): support batch API for DNS record changes |
… flags Adds two new global CLI flags for controlling batch DNS change behaviour: - --batch-change-size (default 200): maximum number of DNS operations per batch - --batch-change-interval (default 1s): pause between consecutive batch chunks Wires the flags through Config into the Cloudflare provider's DNSRecordsConfig.
…back Uses Cloudflare's Batch DNS Records API to submit all creates, updates, and deletes for a zone in a single transactional API call per chunk, significantly reducing the total number of requests made against the Cloudflare API. - Batch size and interval are controlled via --batch-change-size / --batch-change-interval - Record types unsupported by the batch PUT endpoint (e.g. SRV, CAA) are submitted individually via the standard API - If a batch chunk is rejected by Cloudflare, ExternalDNS automatically retries each record change in that chunk individually so no changes are silently lost - Adds cloudflare_batch.go with the core batching logic and full test coverage
03e4e71 to
568133f
Compare
|
@ivankatliarchuk I've removed the quarantine feature and updated the PR comment with the new testing plan + results. I also snuck in a soft error capture for |
ivankatliarchuk
left a comment
There was a problem hiding this comment.
Nice one. Seems fine. What is missing, for all lines changes we expect code coverage. AI could generate it, or manually does not matter. This will help us in case of bugs or follow up refactorings.
Please execute make cover-html and cover all the lines refactored/modified/added
I did some analysis as well
Most important gaps: The entire fallbackUpdates path (SRV/CAA records) and all buildBatchPutParam type branches except A-record have no tests. These are the most
logic-heavy uncovered paths.
---
buildBatchPutParam — only the A-record case is tested:
- Lines 193–218: AAAA and CNAME cases
- Lines 219–258: TXT, MX, NS cases
- Lines 259–262: default branch → return nil, false
buildBatchCollections:
- Lines 301–303: recordID == "" guard for updates (record not found)
- Lines 308–310: bc.fallbackUpdates = append(...) — the SRV/CAA fallback path
submitDNSRecordChanges:
- Lines 348–350: time.Sleep(BatchChangeInterval) — needs multiple chunks to trigger
- Lines 353–366: entire fallbackUpdates loop (SRV/CAA individual updates + error path)
fallbackIndividualChanges:
- Lines 414–417: delete "record already gone" guard
- Lines 425–428: update "record unexpectedly not found" guard
tagsFromResponse:
- Line 156: return nil path
---
cloudflare.go — 7 uncovered blocks (all pre-existing code paths newly exercised by refactor)
- Lines 539–541: error return from getDNSRecordsMap
- Lines 543–545: error return from listCustomHostnamesWithPagination
- Lines 549–551: processCustomHostnameChanges failure → failedChange = true
- Lines 516–532: dry-run regional hostname error paths (3 blocks)
---
cloudflare_custom_hostnames.go
- Lines 297–299: processCustomHostnameChanges failure path (failed = true)
|
I've added more tests to improve code coverage. Addressing your points: |
|
kk. we need a second review /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ivankatliarchuk The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
If you wanna do more improvements, I've opened a PR with some guidelines #6249. In follow-up, when this gets merged, in separate PRs |
|
Cool, thanks for the vote of confidence, and thanks for all your time getting this PR up to snuff. |
|
/lgtm |

What does it do ?
All Cloudflare record changes are currently propagated via individual API calls, plus a redundant query of the entire zone, per record. With large and/or frequent change sets this will lead to being rate limited. This change is a refactor to instead use Cloudflare's batch API.
I've reduced the number of Cloudflare API calls per sync interval from
3 + 2Nto4 + X, whereNis the number of records being changed andXis the number of batches needed. A batch can contain up to 200 records on the free plan and 3500 on paid.Motivation
external-dns pods getting rate limited and crashing. We run a cluster that has grown to 3000-5000+ records in the zone, with frequent changes of 30-100 records at a time. external-dns is no longer reliably keeping up despite attempts to tweak polling intervals and the max records retrieved.
More
Batch API Error Handling
Failed batch submissions will result in retrying each change as individual API calls. They'll keep getting retried every sync interval, though batch operations are effectively blocked until the broken config is fixed. This is also how batch failures behave in the Google and AWS providers.
Initial service manifests
1.
We expect malformed.example.com to cause a failure. The entire batch set is not applied.
fallbackIndividualChangesis invoked and the changes are sent through one-by-one, letting us get a useful error message from the API about what exactly is wrong with which record(s).2.
We make a valid record change but leave the malformed record alone. As long as the malformed record configuration is present batch attempts will continue to fail, but the fallback allows other changes to continue working.
3.
We make another valid record change and also delete the malformed record manifest. Batch succeeds.