Skip to content

Fix issue with peer stream node cleanup.#17235

Merged
hashi-derek merged 3 commits intomainfrom
derekm/NET-3007/fix-peer-stream-cleanup
May 8, 2023
Merged

Fix issue with peer stream node cleanup.#17235
hashi-derek merged 3 commits intomainfrom
derekm/NET-3007/fix-peer-stream-cleanup

Conversation

@hashi-derek
Copy link
Contributor

@hashi-derek hashi-derek commented May 8, 2023

This commit encompasses a few problems that are closely related due to their proximity in the code.

  1. The peerstream utilizes node IDs in several locations to determine which nodes / services / checks should be cleaned up or created. While VM deployments with agents will likely always have a node ID, agentless uses synthetic nodes and does not populate the field. This means that for consul-k8s deployments, all services were likely bundled together into the same synthetic node in some code paths (but not all), resulting in strange behavior. The Node.Node field should be used instead as a unique identifier, as it should always be populated.

  2. The peerstream cleanup process for unused nodes uses an incorrect query for node deregistration. This query is NOT namespace aware and results in the node (and corresponding services) being deregistered prematurely whenever it has zero default-namespace services and 1+ non-default-namespace services registered on it. This issue is tricky to find due to the incorrect logic mentioned in 1, combined with the fact that the affected services must be co-located on the same node as the currently deregistering service for this to be encountered.

  3. The stream tracker did not understand differences between services in different namespaces and could therefore report incorrect numbers. It was updated to utilize the full service name to avoid conflicts and return proper results.

This commit encompasses a few problems that are closely related due to their
proximity in the code.

1. The peerstream utilizes node IDs in several locations to determine which
nodes / services / checks should be cleaned up or created. While VM deployments
with agents will likely always have a node ID, agentless uses synthetic nodes
and does not populate the field. This means that for consul-k8s deployments, all
services were likely bundled together into the same synthetic node in some code
paths (but not all), resulting in strange behavior. The Node.Node field should
be used instead as a unique identifier, as it should always be populated.

2. The peerstream cleanup process for unused nodes uses an incorrect query for
node deregistration. This query is NOT namespace aware and results in the node
(and corresponding services) being deregistered prematurely whenever it has zero
default-namespace services and 1+ non-default-namespace services registered on
it. This issue is tricky to find due to the incorrect logic mentioned in #1,
combined with the fact that the affected services must be co-located on the same
node as the currently deregistering service for this to be encountered.

3. The stream tracker did not understand differences between services in
different namespaces and could therefore report incorrect numbers. It was
updated to utilize the full service name to avoid conflicts and return proper
results.
@hashi-derek hashi-derek added backport/1.14 backport/1.15 This release series is no longer active on CE. Use backport/ent/1.15. labels May 8, 2023
@hashi-derek hashi-derek requested a review from DanStough May 8, 2023 13:26
@hashi-derek hashi-derek marked this pull request as ready for review May 8, 2023 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.15 This release series is no longer active on CE. Use backport/ent/1.15.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants