grpc: ensure grpc resolver correctly uses lan/wan addresses on servers#17270
Merged
grpc: ensure grpc resolver correctly uses lan/wan addresses on servers#17270
Conversation
rboyer
commented
May 9, 2023
agent/peering_endpoint_test.go
Outdated
| }) | ||
|
|
||
| // Check them all for the bad error | ||
| const grpcError = `failed to find Consul server for global address` |
Member
Author
There was a problem hiding this comment.
This error would periodically show up when executing various API calls that ultimately used gRPC to fulfill them under the covers.
rboyer
commented
May 9, 2023
agent/router/router.go
Outdated
|
|
||
| // addServer does the work of AddServer once the write lock is held. | ||
| func (r *Router) addServer(areaID types.AreaID, area *areaInfo, s *metadata.Server) error { | ||
| if areaID == types.AreaLAN { |
Member
Author
There was a problem hiding this comment.
As much as I want to add this prevention, there's nothing stopping someone from having a node named blah.dc1 in dc1 which would have a wan full name of blah.dc1.dc1. At least 2 tests broke on this safety check.
This was referenced May 9, 2023
kisunji
reviewed
May 10, 2023
kisunji
reviewed
May 10, 2023
The grpc resolver implementation is fed from changes to the router.Router. Within the router there is a map of various areas storing the addressing information for servers in those areas. All map entries are of the WAN variety except a single special entry for the LAN. Addressing information in the LAN "area" are local addresses intended for use when making a client-to-server or server-to-server request. The client agent correctly updates this LAN area when receiving lan serf events, so by extension the grpc resolver works fine in that scenario. The server agent only initially populates a single entry in the LAN area (for itself) on startup, and then never mutates that area map again. For normal RPCs a different structure is used for LAN routing. Additionally when selecting a server to contact in the local datacenter it will randomly select addresses from either the LAN or WAN addressed entries in the map. Unfortunately this means that the grpc resolver stack as it exists on server agents is either broken or only accidentally functions by having servers dial each other over the WAN-accessible address. If the operator disables the serf wan port completely likely this incidental functioning would break. This PR enforces that local requests for servers (both for stale reads or leader forwarded requests) exclusively use the LAN "area" information and also fixes it so that servers keep that area up to date in the router. A test for the grpc resolver logic was added, as well as a higher level full-stack test to ensure the externally perceived bug does not return.
This reverts commit 0dc5bbd.
kisunji
reviewed
May 10, 2023
|
|
||
| testutil.RunStep(t, "no server experienced the server resolution error", func(t *testing.T) { | ||
| // Check them all for the bad error | ||
| const grpcError = `failed to find Consul server for global address` |
Contributor
There was a problem hiding this comment.
should we const-ify this in the code? it might get updated by some UX initiative and make this never fail as a side effect.
kisunji
approved these changes
May 10, 2023
This was referenced May 24, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The grpc resolver implementation is fed from changes to the router.Router. Within the router there is a map of various areas storing the addressing information for servers in those areas. All map entries are of the WAN variety except a single special entry for the LAN.
Addressing information in the LAN "area" are local addresses intended for use when making a client-to-server or server-to-server request.
The client agent correctly updates this LAN area when receiving lan serf events, so by extension the grpc resolver works fine in that scenario.
The server agent only initially populates a single entry in the LAN area (for itself) on startup, and then never mutates that area map again. For normal RPCs a different structure is used for LAN routing.
Additionally when selecting a server to contact in the local datacenter it will randomly select addresses from either the LAN or WAN addressed entries in the map.
Unfortunately this means that the grpc resolver stack as it exists on server agents is either broken or only accidentally functions by having servers dial each other over the WAN-accessible address. If the operator disables the serf wan port completely likely this incidental functioning would break.
This PR enforces that local requests for servers (both for stale reads or leader forwarded requests) exclusively use the LAN "area" information and also fixes it so that servers keep that area up to date in the router.
A test for the grpc resolver logic was added, as well as a higher level full-stack test to ensure the externally perceived bug does not return.