Use an artificial timebase for raft tests #255

aaronlehmann · 2016-04-06T02:39:06Z

Instead of running the raft tick on a real time clock and using sleeps
to wait for time to elapse, use github.com/pivotal-golang/clock to fake
a timebase that we can advance at will.

When we want to advance the raft state machine's view of time, call the
Increment function to fake it.

When we want to wait for something async to happen, use polling so we
can catch it quickly without a fixed sleep.

Before: ok github.com/docker/swarm-v2/manager/state 20.064s
After: ok github.com/docker/swarm-v2/manager/state 1.257s

This should also hopefully make the tests more deterministic.

Fixes #182

cc @abronan @stevvooe

aaronlehmann · 2016-04-06T02:51:42Z

Changed to WIP. Just noticed that the checks on listMembers can race with replaying the raft logs, and cause tests to fail. This is not new, but is much easier to trigger with the accelerated clock. Will have to add polling for those checks.

This may explain why the tests seem to be hanging in CI. I haven't been able to reproduce this behavior locally.

aaronlehmann · 2016-04-06T19:11:11Z

Pushed some fixes for problems exposed by this change. Some of them are actual issues with the raft code.

Add locking around shutdown process to prevent asynchronous RPCs from crashing. Fixes Race and panic in ProcessRaftMessage #148.
Close open GRPC connections when shutting down a raft member. Avoids leaking those connections in the tests.
Make waitForCluster wait until all all members have applied the same portion of the log to their state machine.
TestRaftFollowerLeave needs to call Leave on a member other than the one that is leaving the cluster. Otherwise, this call can hang forever, because the machine can be removed from the cluster before it gets confirmation that it has left.
Fix deregisterNode not to check the error from Close. This can return an error when the connection was not yet open to begin with.

I left these changes in a separate commit so they can be reviewed more easily.

This seems pretty solid now. It survived about 250 runs without failures (though somehow this triggered a kernel panic on my laptop!).

Removing WIP label. PTAL

docker-codecov-bot · 2016-04-06T19:13:04Z

Current coverage is `57.84%`

Merging #255 into master will increase coverage by +0.23% as of 547188b

@@            master    #255   diff @@
======================================
  Files           45      45       
  Stmts         6011    6030    +19
  Branches       896     894     -2
  Methods          0       0       
======================================
+ Hit           3463    3488    +25
  Partial        487     487       
+ Missed        2061    2055     -6

Review entire Coverage Diff as of 547188b

Uncovered Suggestions

+0.70% via ...ager/state/memory.go#293...334
+0.38% via agent/session.go#187...209
+0.30% via agent/agent.go#542...559
See 7 more...

Powered by Codecov. Updated on successful CI builds.

Instead of running the raft tick on a real time clock and using sleeps to wait for time to elapse, use github.com/pivotal-golang/clock to fake a timebase that we can advance at will. When we want to advance the raft state machine's view of time, call the Increment function to fake it. When we want to wait for something async to happen, use polling so we can catch it quickly without a fixed sleep. Before: ok github.com/docker/swarm-v2/manager/state 20.064s After: ok github.com/docker/swarm-v2/manager/state 1.257s This should also hopefully make the tests more deterministic. Fixes #182 Signed-off-by: Aaron Lehmann <[email protected]>

- Add locking around shutdown process to prevent asynchronous RPCs from crashing. Fixes #148. - Close open GRPC connections when shutting down a raft member. Avoids leaking those connections in the tests. - Make waitForCluster wait until all all members have applied the same portion of the log to their state machine. - TestRaftFollowerLeave needs to call Leave on a member other than the one that is leaving the cluster. Otherwise, this call can hang forever, because the machine can be removed from the cluster before it gets confirmation that it has left. - Fix deregisterNode not to check the error from Close. This can return an error when the connection was not yet open to begin with. Signed-off-by: Aaron Lehmann <[email protected]>

aaronlehmann · 2016-04-06T21:19:29Z

Rebased.

stevvooe · 2016-04-06T21:55:17Z

When we want to wait for something async to happen, use polling so we
can catch it quickly without a fixed sleep.

I'm inexperienced in the use of switching out a clock to create determinism. Typically, I've found to be more robust to make the conditions under test actually detectable (@aaronlehmann you've seen first hand experiences with my failed attempts at this at times). The other issue that comes to mind is that not all components will allow us to inject this time base. One example would be context timeouts, but there may be others.

I may be wildly wrong here and I don't have a good suggestion as how to improve this particular code base to make it more amenable to an "advancement" style of testing. I've run into similar problems with testing the agent, so I'll see if I can better understand code structure in this area to make this less problematic.

aaronlehmann · 2016-04-06T22:16:58Z

I'm inexperienced in the use of switching out a clock to create determinism. Typically, I've found to be more robust to make the conditions under test actually detectable (@aaronlehmann you've seen first hand experiences with my failed attempts at this at times). The other issue that comes to mind is that not all components will allow us to inject this time base. One example would be context timeouts, but there may be others.

Let's break this down into two things:

(1) We want tests to behave deterministically. For example, we don't want code like:

triggerAsyncEvent()
time.Sleep(time.Second)
assert.True(t, eventHasHappened())

(2) It's desirable to run these tests faster than real-time. A good example is that a raft election takes a certain number of ticks. If each tick represents a second of real time, that makes the tests relatively slow.

The primary point of this PR is to address (2), but in doing so, there's extra motivation to deal with (1). In a lot of cases, the code on master just sleeps for 500 ms to wait for something to happen. Now that the tests don't inherently need several seconds each to run, it's especially irritating to have those 500 ms sleeps that may be flaky. So as part of this PR, I've removed those sleeps and replaced them with polling. But it's secondary to the main improvement of using a fake timebase, and could have been done as a separate change.

As for making the tests more event-driven, this is certainly possible, but would require exposing a lot more raft internals in an event-driven way. For example, right now we poll the cluster members to see that they have agreed on a leader, a term value, and an applied index. We could potentially expose channels for all of these things so we do this is an event-triggered way, but it would come at the cost of extra complexity in the raft code (though now that I think about it, maybe using the watch package would make it not so bad). If there is a need for the actual program to know about these things, providing events is almost certainly the right way to do it. But I'm not sure I want to create this event mechanism if polling works well enough for unit tests.

Hope this helps provide some context. These are some pretty interesting topics and I'm definitely interested in continuing to discuss them and iterate the approach. My main concern is that the tests accurately cover what we need them to. One could argue that hypothetical races could trigger when the state machine runs in real time, but not when we advance time artificially. This would be an argument against using this fake timebase at all. But the flip side is that having slow tests that take minutes to run instead of seconds discourages comprehensive test coverage. Perhaps as a compromise, we could use the fake timebase for go test -short, and otherwise run the tests in both ways.

aaronlehmann · 2016-04-06T22:26:30Z

Also, on the point of not all components allowing us to inject this timebase, I don't think that's really a concern. The reason for the fake timebase is to allow raft ticks to not take significant amount of real time (think of it like changing the tick time to 1 ms, except if we were to actually do this, we'd be making the tick time unreasonably fast for the state machine to keep up, and would probably cause instability). But for things like connection timeouts, we still want real time to apply - at least for all the existing tests.

stevvooe · 2016-04-06T22:34:51Z

Well put.

Ideally, we could switch out the time.Sleep to something like this:

triggerAsyncEvent()
waitForEventToHappen()
assert.True(t, eventHasHappened())

To deal with the general length of the tests, the actual parameters for the cluster could be reduced accordingly (ie use time.Millisecond instead of time.Second).

You can see my general struggle here. I'll continue to experiment with this, as the agent code base is much smaller.

LGTM

abronan · 2016-04-07T01:19:05Z

As for making the tests more event-driven, this is certainly possible, but would require exposing a lot more raft internals in an event-driven way

Can't agree more, we can definitely do that but I'm not sure this is worth the extra-cost in complexity especially if this benefits tests only.

@aaronlehmann My only question would be: any benefit of using pivotal-golang/clock over jonboulle/clockwork which is used by etcd? Cockroach also have their own implementation of a fake clock for tests. They might all be very similar but just wondering.

LGTM

(we went from 26s in the CI for store tests to 3.5s 🎉 🎈)

aaronlehmann · 2016-04-07T01:42:45Z

@abronan: I looked at clockwork first, but it doesn't support tickers, so I looked for alternatives and found pivotal-golang/clock, which gave a drop-in replacement for time.NewTicker.

stevvooe · 2016-04-08T03:23:22Z

Can't agree more, we can definitely do that but I'm not sure this is worth the extra-cost in complexity especially if this benefits tests only.

I'm not sure if I agree with this statement. If we can't detect the desired states, we probably aren't getting it right.

abronan · 2016-04-08T18:18:43Z

@stevvooe polling is a mean like others to detect convergence of values. In this regard this should be enough for unit tests.

Detecting the desired state with an event model in unit tests (in contrast to just polling the values) does not make our distributed system model more sound, because in a true environment it will be subject to network partitions, message replays, etc. The event model will just make our tests more convenient to read or write because detecting the convergence will be easier through watches/channels. But as aaron rightfully mentioned, it comes at the cost of extra complexity in our raft internals.

The only purpose of raft is making sure that the state is consistent on a set of machines by making them agree on values and respecting a total order. But never the leader or any machine for that matter has to detect if the state applied locally on raft members converged or not. If a machine has applied the logs or not, or applied a log entry in the persistent db is none of our business because we assume it will do (at our level at least).

Also, it does not mean that we should ignore network partitions, etc. in tests. We can maybe include that in automated scale tests with tools like Jepsen (simulating failures) and analyze the db traces to detect inconsistencies in the desired state.

aaronlehmann changed the title ~~Use an artificial timebase for raft tests~~ [WIP] Use an artificial timebase for raft tests Apr 6, 2016

aaronlehmann changed the title ~~[WIP] Use an artificial timebase for raft tests~~ Use an artificial timebase for raft tests Apr 6, 2016

aaronlehmann added 2 commits April 6, 2016 14:18

aaronlehmann force-pushed the clockwork branch from 540eb8a to 547188b Compare April 6, 2016 21:19

aaronlehmann merged commit 87bac4f into moby:master Apr 7, 2016

aaronlehmann deleted the clockwork branch April 7, 2016 01:44

thaJeztah mentioned this pull request Feb 8, 2022

etcd 3.5 and upgrade grpc/protobuf #3051

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use an artificial timebase for raft tests #255

Use an artificial timebase for raft tests #255

aaronlehmann commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

docker-codecov-bot commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

stevvooe commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

stevvooe commented Apr 6, 2016

abronan commented Apr 7, 2016

aaronlehmann commented Apr 7, 2016

stevvooe commented Apr 8, 2016

abronan commented Apr 8, 2016

Use an artificial timebase for raft tests #255

Use an artificial timebase for raft tests #255

Conversation

aaronlehmann commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

docker-codecov-bot commented Apr 6, 2016

Current coverage is 57.84%

Uncovered Suggestions

aaronlehmann commented Apr 6, 2016

stevvooe commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

aaronlehmann commented Apr 6, 2016

stevvooe commented Apr 6, 2016

abronan commented Apr 7, 2016

aaronlehmann commented Apr 7, 2016

stevvooe commented Apr 8, 2016

abronan commented Apr 8, 2016

Current coverage is `57.84%`