Skip to content

Conversation

@nic-6443
Copy link
Member

@nic-6443 nic-6443 commented Jan 15, 2025

Description

Background

For a route configured with an upstream using service discovery, when a request hits the route, it will fetch the latest nodes of the current upstream through discovery.nodes and compare it using the compare_upstream_node function. Only if there is a change in the node list, new_nodes will be set to upstream.nodes. Then we copy upstream using table.clone, creating a new table to replace the previous upstream.
Another function worth mentioning is fill_node_info, which fills some necessary fields in upstream.nodes.

Race Condition

Now let me describe a race condition scenario:
Request A gets [{"port":80,"weight":50,"host":"10.244.1.33"}] from  discovery.nodes. After executing  fill_node_info, it becomes  {"port":80,"weight":50,"host":"10.244.1.33", "priority": 0}.
then due to some function calls triggering coroutine yield (as tested, pcall triggers yield).
At this point, Request B hits the same route and gets an identical  upstream table as Request A but get new nodes:[{"port":80,"weight":50,"host":"10.244.1.34"}] from  discovery.nodes.
Since our current code update upstream.nodes   before calling table.clone, when coroutine switches back to Request A,
A's upstream.nodes has been modified to [{"port":80,"weight":50,"host":"10.244.1.34"}].
These nodes without priority field cause code panic:

2024/09/04 14:39:59 [error] 47#47: *739315077 lua entry thread aborted: runtime error: /usr/local/apisix/apisix/balancer.lua:55: table index is nil
stack traceback:
coroutine 0:
        /usr/local/apisix/apisix/balancer.lua: in function 'transform_node'
        /usr/local/apisix/apisix/balancer.lua:80: in function 'fetch_health_nodes'
        /usr/local/apisix/apisix/balancer.lua:115: in function 'create_obj_fun'
        /usr/local/apisix/apisix/core/lrucache.lua:95: in function 'lrucache_server_picker'
        /usr/local/apisix/apisix/balancer.lua:247: in function 'pick_server'
        /usr/local/apisix/apisix/init.lua:498: in function 'handle_upstream'
        /usr/local/apisix/apisix/init.lua:680: in function 'http_access_phase'

This issue can only occur when the nodes obtained through service discovery are continuously changing (e.g., during a rolling update of a Kubernetes deployment) and the gateway is receiving high concurrent requests. So I am unable to provide a valid test case.

Fixes # (issue)

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Jan 15, 2025
@nic-6443 nic-6443 marked this pull request as draft January 15, 2025 09:46
Signed-off-by: Nic <[email protected]>
@nic-6443 nic-6443 marked this pull request as ready for review January 22, 2025 03:42
Copy link
Member

@membphis membphis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nic-6443 nic-6443 merged commit 2ddfc92 into apache:master Feb 7, 2025
30 checks passed
@nic-6443 nic-6443 deleted the nic/fix-race-condition branch February 7, 2025 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants