fix: race condition problem while update upstream.nodes #11916

nic-6443 · 2025-01-15T08:45:08Z

Description

Background

For a route configured with an upstream using service discovery, when a request hits the route, it will fetch the latest nodes of the current upstream through discovery.nodes and compare it using the compare_upstream_node function. Only if there is a change in the node list, new_nodes will be set to upstream.nodes. Then we copy upstream using table.clone, creating a new table to replace the previous upstream.
Another function worth mentioning is fill_node_info, which fills some necessary fields in upstream.nodes.

Race Condition

Now let me describe a race condition scenario:
Request A gets [{"port":80,"weight":50,"host":"10.244.1.33"}] from discovery.nodes. After executing fill_node_info, it becomes {"port":80,"weight":50,"host":"10.244.1.33", "priority": 0}.
then due to some function calls triggering coroutine yield (as tested, pcall triggers yield).
At this point, Request B hits the same route and gets an identical upstream table as Request A but get new nodes:[{"port":80,"weight":50,"host":"10.244.1.34"}] from discovery.nodes.
Since our current code update upstream.nodes before calling table.clone, when coroutine switches back to Request A,
A's upstream.nodes has been modified to [{"port":80,"weight":50,"host":"10.244.1.34"}].
These nodes without priority field cause code panic:

2024/09/04 14:39:59 [error] 47#47: *739315077 lua entry thread aborted: runtime error: /usr/local/apisix/apisix/balancer.lua:55: table index is nil
stack traceback:
coroutine 0:
        /usr/local/apisix/apisix/balancer.lua: in function 'transform_node'
        /usr/local/apisix/apisix/balancer.lua:80: in function 'fetch_health_nodes'
        /usr/local/apisix/apisix/balancer.lua:115: in function 'create_obj_fun'
        /usr/local/apisix/apisix/core/lrucache.lua:95: in function 'lrucache_server_picker'
        /usr/local/apisix/apisix/balancer.lua:247: in function 'pick_server'
        /usr/local/apisix/apisix/init.lua:498: in function 'handle_upstream'
        /usr/local/apisix/apisix/init.lua:680: in function 'http_access_phase'

This issue can only occur when the nodes obtained through service discovery are continuously changing (e.g., during a rolling update of a Kubernetes deployment) and the gateway is receiving high concurrent requests. So I am unable to provide a valid test case.

Fixes # (issue)

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Signed-off-by: Nic <[email protected]>

membphis

LGTM

fix: race condition problem while update upstream.nodes

125f8a1

Signed-off-by: Nic <[email protected]>

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Jan 15, 2025

nic-6443 marked this pull request as draft January 15, 2025 09:46

fix

1030c8a

Signed-off-by: Nic <[email protected]>

nic-6443 marked this pull request as ready for review January 22, 2025 03:42

Revolyssup approved these changes Jan 22, 2025

View reviewed changes

nic-chen approved these changes Jan 22, 2025

View reviewed changes

shreemaan-abhishek approved these changes Jan 29, 2025

View reviewed changes

membphis approved these changes Feb 7, 2025

View reviewed changes

nic-6443 merged commit 2ddfc92 into apache:master Feb 7, 2025
30 checks passed

nic-6443 deleted the nic/fix-race-condition branch February 7, 2025 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: race condition problem while update upstream.nodes #11916

fix: race condition problem while update upstream.nodes #11916

Uh oh!

nic-6443 commented Jan 15, 2025 •

edited

Loading

Uh oh!

membphis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: race condition problem while update upstream.nodes #11916

fix: race condition problem while update upstream.nodes #11916

Uh oh!

Conversation

nic-6443 commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background

Race Condition

Checklist

Uh oh!

membphis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nic-6443 commented Jan 15, 2025 •

edited

Loading