fix: race condition problem while update upstream.nodes #11916
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Background
For a route configured with an upstream using service discovery, when a request hits the route, it will fetch the latest nodes of the current upstream through
discovery.nodesand compare it using thecompare_upstream_nodefunction. Only if there is a change in the node list,new_nodeswill be set toupstream.nodes. Then we copyupstreamusingtable.clone, creating a new table to replace the previous upstream.Another function worth mentioning is
fill_node_info, which fills some necessary fields inupstream.nodes.Race Condition
Now let me describe a race condition scenario:
Request A gets
[{"port":80,"weight":50,"host":"10.244.1.33"}]fromdiscovery.nodes. After executingfill_node_info, it becomes{"port":80,"weight":50,"host":"10.244.1.33", "priority": 0}.then due to some function calls triggering coroutine yield (as tested, pcall triggers yield).
At this point, Request B hits the same route and gets an identical
upstreamtable as Request A but get new nodes:[{"port":80,"weight":50,"host":"10.244.1.34"}]fromdiscovery.nodes.Since our current code update
upstream.nodesbefore callingtable.clone, when coroutine switches back to Request A,A's
upstream.nodeshas been modified to[{"port":80,"weight":50,"host":"10.244.1.34"}].These nodes without priority field cause code panic:
This issue can only occur when the nodes obtained through service discovery are continuously changing (e.g., during a rolling update of a Kubernetes deployment) and the gateway is receiving high concurrent requests. So I am unable to provide a valid test case.
Fixes # (issue)
Checklist