Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocator startup can perform many raft writes #1286

Closed
aaronlehmann opened this issue Aug 1, 2016 · 0 comments
Closed

Allocator startup can perform many raft writes #1286

aaronlehmann opened this issue Aug 1, 2016 · 0 comments

Comments

@aaronlehmann
Copy link
Collaborator

In doNetworkInit, allocations of networks, nodes, and services aren't batched. Loading a swarm state with thousands of nodes appears to result in many raft writes from doNetworkInit's calls to allocateNode. This could block for a long time in a multi-manager setup where writes need to be acknowledged by a quorum of managers.

cc @mrjana

aaronlehmann added a commit to aaronlehmann/swarmkit that referenced this issue Aug 1, 2016
When loading a state that contained large numbers of nodes and tasks,
but no ready nodes that could accept the tasks, swarmd used large
amounts of CPU repeatedly trying to schedule the full set of tasks. The
allocator caused many commits on startup (see moby#1286), and this produced
a large backlog of commit events, each one of which caused a full
scheduling pass.

To avoid this pathological behavior, debounce the commit events
similarly to how the dispatcher's Tasks loop debounces events. When a
commit event is received, that starts a 50 ms countdown to wait for
another commit event before running the scheduling pass. If commit
events keep being received and resetting this timer, the scheduler will
run the scheduling pass anyway after a second.

Signed-off-by: Aaron Lehmann <[email protected]>
aaronlehmann added a commit to aaronlehmann/swarmkit that referenced this issue Aug 1, 2016
When loading a state that contained large numbers of nodes and tasks,
but no ready nodes that could accept the tasks, swarmd used large
amounts of CPU repeatedly trying to schedule the full set of tasks. The
allocator caused many commits on startup (see moby#1286), and this produced
a large backlog of commit events, each one of which caused a full
scheduling pass.

To avoid this pathological behavior, debounce the commit events
similarly to how the dispatcher's Tasks loop debounces events. When a
commit event is received, that starts a 50 ms countdown to wait for
another commit event before running the scheduling pass. If commit
events keep being received and resetting this timer, the scheduler will
run the scheduling pass anyway after a second.

Signed-off-by: Aaron Lehmann <[email protected]>
aaronlehmann added a commit that referenced this issue Aug 1, 2016
When loading a state that contained large numbers of nodes and tasks,
but no ready nodes that could accept the tasks, swarmd used large
amounts of CPU repeatedly trying to schedule the full set of tasks. The
allocator caused many commits on startup (see #1286), and this produced
a large backlog of commit events, each one of which caused a full
scheduling pass.

To avoid this pathological behavior, debounce the commit events
similarly to how the dispatcher's Tasks loop debounces events. When a
commit event is received, that starts a 50 ms countdown to wait for
another commit event before running the scheduling pass. If commit
events keep being received and resetting this timer, the scheduler will
run the scheduling pass anyway after a second.

Signed-off-by: Aaron Lehmann <[email protected]>
(cherry picked from commit 77c62db)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant