Skip to content

xds: change cdsbalancer to use update from dependency manager#8907

Merged
eshitachandwani merged 18 commits intogrpc:masterfrom
eshitachandwani:cdsbalancer_use_depmgr
Mar 2, 2026
Merged

xds: change cdsbalancer to use update from dependency manager#8907
eshitachandwani merged 18 commits intogrpc:masterfrom
eshitachandwani:cdsbalancer_use_depmgr

Conversation

@eshitachandwani
Copy link
Copy Markdown
Member

@eshitachandwani eshitachandwani commented Feb 16, 2026

This PR is part of A74 changes. This PR changes following:

  1. Changes the CDS balancer to use the CDS and EDS/DNS update from XDSConfig received by xds_resolver instead of starting its own CDS watchers.
  2. Remove cluster resolver completely
  3. CDS balancer creates the priority config and creates priority balancer as its child.
  4. Remove serializer from CDS balancer since it was there to serializer watcher updates and ClientConn updates, now that there are no watchers and clientConn updates are already serialized , serializer is no longer needed.
  5. Update tests in cdsbalancer package to check the priority config instead of cluster_resolver config.
  6. xds_resolver tracks reference counts for weighted cluster and cluster specifier plugin seperately. xds_resolver subscribes to the weighted cluster in dependency manager when referenced for first time and unsubscribes when references in xds_resolver go to zero.
  7. Incase of LDS/RDS resource error , we now directly send empty service config , instead of sending a complete service config with clusters that RPCs are still referenced to.
  8. Minor test fixes.

RELEASE NOTES:

  • xds:
    • Ambient errors for cluster resources are now logged exclusively in the dependency manager and are no longer propagated to Load Balancing (LB) policies.
    • When re-resolution is requested, all clusters of type LOGICAL_DNS will be re-resolved simultaneously, rather than only a single cluster.
    • Upon receipt of a listener or route resource error, all in-flight RPCs will now fail immediately.
    • Any error encountered during the creation or update of a priority configuration will now transition the channel to a TRANSIENT_FAILURE state.

@eshitachandwani eshitachandwani added this to the 1.80 Release milestone Feb 16, 2026
@eshitachandwani eshitachandwani added Type: Feature New features or improvements in behavior Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Feb 16, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 80.40000% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.38%. Comparing base (944f058) to head (700015c).
⚠️ Report is 16 commits behind head on master.

Files with missing lines Patch % Lines
internal/xds/balancer/cdsbalancer/cdsbalancer.go 69.28% 30 Missing and 13 partials ⚠️
internal/xds/balancer/cdsbalancer/configbuilder.go 95.12% 2 Missing ⚠️
internal/xds/resolver/serviceconfig.go 90.00% 1 Missing and 1 partial ⚠️
internal/xds/xdsdepmgr/xds_dependency_manager.go 91.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8907      +/-   ##
==========================================
+ Coverage   80.74%   83.38%   +2.63%     
==========================================
  Files         416      410       -6     
  Lines       33434    32570     -864     
==========================================
+ Hits        26996    27157     +161     
+ Misses       4641     4038     -603     
+ Partials     1797     1375     -422     
Files with missing lines Coverage Δ
...ds/balancer/cdsbalancer/configbuilder_childname.go 100.00% <ø> (ø)
...rnal/xds/balancer/clustermanager/clustermanager.go 80.95% <100.00%> (+9.52%) ⬆️
internal/xds/resolver/xds_resolver.go 91.87% <100.00%> (+3.16%) ⬆️
internal/xds/balancer/cdsbalancer/configbuilder.go 91.94% <95.12%> (ø)
internal/xds/resolver/serviceconfig.go 86.39% <90.00%> (-1.76%) ⬇️
internal/xds/xdsdepmgr/xds_dependency_manager.go 88.27% <91.66%> (+18.96%) ⬆️
internal/xds/balancer/cdsbalancer/cdsbalancer.go 69.78% <69.28%> (-16.48%) ⬇️

... and 52 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread internal/xds/balancer/clusterimpl/tests/balancer_test.go Outdated
Comment thread internal/xds/xdsdepmgr/xds_dependency_manager.go
Comment thread internal/xds/xdsdepmgr/xds_dependency_manager.go
Comment thread internal/xds/xdsdepmgr/xds_dependency_manager.go
Comment thread internal/xds/balancer/cdsbalancer/cdsbalancer_test.go Outdated
Comment thread internal/xds/balancer/cdsbalancer/cdsbalancer_test.go Outdated
Comment thread internal/xds/balancer/cdsbalancer/cdsbalancer_test.go Outdated
Comment thread internal/xds/balancer/cdsbalancer/cdsbalancer_test.go Outdated
Comment thread internal/xds/balancer/cdsbalancer/cdsbalancer_test.go Outdated
@easwars easwars assigned eshitachandwani and unassigned easwars Feb 18, 2026
@eshitachandwani
Copy link
Copy Markdown
Member Author

Regarding all the comments talking about eating the error , we were doing that since that was the current behaviour of cluster resolver i.e. it was eating errors instead of returning the error or putting the channel in TF. After an offline discussion we have decided to change the behaviour and any error in CDS LB policy i.e. any error relating to CDS not liking the update is returned from UpdateClientConnState effectively effectively moving the particular channel in TF

Copy link
Copy Markdown
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you have replied saying "done" on a bunch of comments, but I don't see it being done. Are you missing some commits here?

Comment thread internal/xds/balancer/clusterimpl/tests/balancer_test.go Outdated
Comment thread internal/xds/resolver/xds_resolver.go Outdated
Comment thread internal/xds/xdsdepmgr/xds_dependency_manager.go
Comment thread internal/xds/xdsdepmgr/xds_dependency_manager.go
Comment thread internal/xds/xdsdepmgr/xds_dependency_manager_test.go
@easwars easwars assigned eshitachandwani and unassigned easwars Feb 24, 2026
@eshitachandwani
Copy link
Copy Markdown
Member Author

I see that you have replied saying "done" on a bunch of comments, but I don't see it being done. Are you missing some commits here?

I had pushed all the changes and can see the changes reflected in the PR. Can you please try again and let me know if the older and current changes are still not visibile.

Copy link
Copy Markdown
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM at this point. Will make another pass shortly.

Comment thread internal/xds/balancer/cdsbalancer/cdsbalancer.go Outdated
Comment thread internal/xds/balancer/cdsbalancer/cdsbalancer.go Outdated
Comment thread internal/xds/balancer/cdsbalancer/cdsbalancer.go Outdated
Comment thread internal/xds/resolver/xds_resolver.go
Copy link
Copy Markdown
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just minor nits.

Comment on lines 111 to 112
// Start an xDS management server that pushes the EDS resource names onto a
// channel when requested.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs to be updated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +130 to +131
// Check if we have a request for both EDS resources. If so, fire
// the event to unblock the test.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: What this event signifies should ideally be documented at the place where the event is defined.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

watchers map[string]*watcherState // Set of watchers and associated state, keyed by cluster name.
lbCfg *lbConfig // Current load balancing configuration.
childLB balancer.Balancer // Child policy, built upon resolution of the cluster graph.
xdsClient xdsclient.XDSClient // xDS client to watch Cluster resources.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Clarify in this comment that this is only for dynamic subscriptions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed for dynamic subscription but is passed so that cluster_impl can use it for Load reporting. Here it is mainly used to get nodeID. So removed the comment.

lbCfg *lbConfig // Current load balancing configuration.
childLB balancer.Balancer // Child policy, built upon resolution of the cluster graph.
xdsClient xdsclient.XDSClient // xDS client to watch Cluster resources.
clusterConfigs map[string]*xdsresource.ClusterResult // Map of cluster name to the last received result for that cluster.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Keep these trialing comments short and sweet.

Something like:

  • Cluster name to the last received result for that cluster
  • Hostname to priority config for that leaf cluster
    ...
  • For dynamic cluster unsubscription
  • True if a dynamic cluster has been subscribed to

etc .. etc

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please remove mentions of the serializer as it does not exist anymore.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -168,16 +162,22 @@ type cdsBalancer struct {

xdsHIPtr *unsafe.Pointer // Accessed atomically.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to state why this needs to be accessed atomically.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will move to clusterimpl, will change the comment there.

xdsClient xdsclient.XDSClient // xDS client to watch Cluster resources.
watchers map[string]*watcherState // Set of watchers and associated state, keyed by cluster name.
lbCfg *lbConfig // Current load balancing configuration.
childLB balancer.Balancer // Child policy, built upon resolution of the cluster graph.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be nice to state that all fields below are accessed only from methods on the balancer.Balancer interface. And because grpc guarantees that those methods will never be called concurrently, we don't need any extra serialization for accessing these fields.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +133 to +134
sortedNames := make([]string, len(names))
copy(sortedNames, names)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be replaced with a single line if you use slices.Clone.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

names := req.GetResourceNames()
sortedNames := make([]string, len(names))
copy(sortedNames, names)
sort.Strings(sortedNames)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use slices.Sort here as well? gemini says its better than sort.Strings(). I know it doesn't matter here. But whatever.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

const clusterName1 = clusterName + "-cluster-1"
const clusterName2 = clusterName + "-cluster-2"

gotEDSRequests := grpcsync.NewEvent()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs a better name, and a comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@easwars easwars assigned eshitachandwani and unassigned easwars Feb 27, 2026
@easwars
Copy link
Copy Markdown
Contributor

easwars commented Feb 27, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the xDS implementation, primarily focusing on the CDS balancer. The core change involves centralizing resource watching logic within the dependency manager, which simplifies the CDS balancer by removing its responsibility for managing its own resource watchers. As part of this, the cluster_resolver balancer is completely removed, and the CDS balancer now uses the priority balancer as its child. The error handling for xDS resources has also been improved, making the system more robust. The tests have been thoroughly updated to reflect these architectural changes. My review includes one suggestion to improve the robustness of a test case.

Comment thread internal/xds/balancer/clusterimpl/tests/balancer_test.go
@eshitachandwani eshitachandwani merged commit 1d4fa8a into grpc:master Mar 2, 2026
14 checks passed
This was referenced Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants