Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use an artificial timebase for raft tests #255

Merged
merged 2 commits into from
Apr 7, 2016

Conversation

aaronlehmann
Copy link
Collaborator

Instead of running the raft tick on a real time clock and using sleeps
to wait for time to elapse, use github.com/pivotal-golang/clock to fake
a timebase that we can advance at will.

When we want to advance the raft state machine's view of time, call the
Increment function to fake it.

When we want to wait for something async to happen, use polling so we
can catch it quickly without a fixed sleep.

Before: ok github.com/docker/swarm-v2/manager/state 20.064s
After: ok github.com/docker/swarm-v2/manager/state 1.257s

This should also hopefully make the tests more deterministic.

Fixes #182

cc @abronan @stevvooe

@aaronlehmann aaronlehmann changed the title Use an artificial timebase for raft tests [WIP] Use an artificial timebase for raft tests Apr 6, 2016
@aaronlehmann
Copy link
Collaborator Author

Changed to WIP. Just noticed that the checks on listMembers can race with replaying the raft logs, and cause tests to fail. This is not new, but is much easier to trigger with the accelerated clock. Will have to add polling for those checks.

This may explain why the tests seem to be hanging in CI. I haven't been able to reproduce this behavior locally.

@aaronlehmann
Copy link
Collaborator Author

Pushed some fixes for problems exposed by this change. Some of them are actual issues with the raft code.

  • Add locking around shutdown process to prevent asynchronous RPCs from crashing. Fixes Race and panic in ProcessRaftMessage #148.
  • Close open GRPC connections when shutting down a raft member. Avoids leaking those connections in the tests.
  • Make waitForCluster wait until all all members have applied the same portion of the log to their state machine.
  • TestRaftFollowerLeave needs to call Leave on a member other than the one that is leaving the cluster. Otherwise, this call can hang forever, because the machine can be removed from the cluster before it gets confirmation that it has left.
  • Fix deregisterNode not to check the error from Close. This can return an error when the connection was not yet open to begin with.

I left these changes in a separate commit so they can be reviewed more easily.

This seems pretty solid now. It survived about 250 runs without failures (though somehow this triggered a kernel panic on my laptop!).

Removing WIP label. PTAL

@aaronlehmann aaronlehmann changed the title [WIP] Use an artificial timebase for raft tests Use an artificial timebase for raft tests Apr 6, 2016
@docker-codecov-bot
Copy link

Current coverage is 57.84%

Merging #255 into master will increase coverage by +0.23% as of 547188b

@@            master    #255   diff @@
======================================
  Files           45      45       
  Stmts         6011    6030    +19
  Branches       896     894     -2
  Methods          0       0       
======================================
+ Hit           3463    3488    +25
  Partial        487     487       
+ Missed        2061    2055     -6

Review entire Coverage Diff as of 547188b


Uncovered Suggestions

  1. +0.70% via ...ager/state/memory.go#293...334
  2. +0.38% via agent/session.go#187...209
  3. +0.30% via agent/agent.go#542...559
  4. See 7 more...

Powered by Codecov. Updated on successful CI builds.

Instead of running the raft tick on a real time clock and using sleeps
to wait for time to elapse, use github.com/pivotal-golang/clock to fake
a timebase that we can advance at will.

When we want to advance the raft state machine's view of time, call the
Increment function to fake it.

When we want to wait for something async to happen, use polling so we
can catch it quickly without a fixed sleep.

Before: ok  	github.com/docker/swarm-v2/manager/state	20.064s
After:  ok       github.com/docker/swarm-v2/manager/state        1.257s

This should also hopefully make the tests more deterministic.

Fixes #182

Signed-off-by: Aaron Lehmann <[email protected]>
- Add locking around shutdown process to prevent asynchronous RPCs from
  crashing. Fixes #148.

- Close open GRPC connections when shutting down a raft member. Avoids
  leaking those connections in the tests.

- Make waitForCluster wait until all all members have applied the
  same portion of the log to their state machine.

- TestRaftFollowerLeave needs to call Leave on a member other than the
  one that is leaving the cluster. Otherwise, this call can hang
  forever, because the machine can be removed from the cluster before it
  gets confirmation that it has left.

- Fix deregisterNode not to check the error from Close. This can return
  an error when the connection was not yet open to begin with.

Signed-off-by: Aaron Lehmann <[email protected]>
@aaronlehmann
Copy link
Collaborator Author

Rebased.

@stevvooe
Copy link
Contributor

stevvooe commented Apr 6, 2016

When we want to wait for something async to happen, use polling so we
can catch it quickly without a fixed sleep.

I'm inexperienced in the use of switching out a clock to create determinism. Typically, I've found to be more robust to make the conditions under test actually detectable (@aaronlehmann you've seen first hand experiences with my failed attempts at this at times). The other issue that comes to mind is that not all components will allow us to inject this time base. One example would be context timeouts, but there may be others.

I may be wildly wrong here and I don't have a good suggestion as how to improve this particular code base to make it more amenable to an "advancement" style of testing. I've run into similar problems with testing the agent, so I'll see if I can better understand code structure in this area to make this less problematic.

@aaronlehmann
Copy link
Collaborator Author

I'm inexperienced in the use of switching out a clock to create determinism. Typically, I've found to be more robust to make the conditions under test actually detectable (@aaronlehmann you've seen first hand experiences with my failed attempts at this at times). The other issue that comes to mind is that not all components will allow us to inject this time base. One example would be context timeouts, but there may be others.

Let's break this down into two things:

(1) We want tests to behave deterministically. For example, we don't want code like:

triggerAsyncEvent()
time.Sleep(time.Second)
assert.True(t, eventHasHappened())

(2) It's desirable to run these tests faster than real-time. A good example is that a raft election takes a certain number of ticks. If each tick represents a second of real time, that makes the tests relatively slow.

The primary point of this PR is to address (2), but in doing so, there's extra motivation to deal with (1). In a lot of cases, the code on master just sleeps for 500 ms to wait for something to happen. Now that the tests don't inherently need several seconds each to run, it's especially irritating to have those 500 ms sleeps that may be flaky. So as part of this PR, I've removed those sleeps and replaced them with polling. But it's secondary to the main improvement of using a fake timebase, and could have been done as a separate change.

As for making the tests more event-driven, this is certainly possible, but would require exposing a lot more raft internals in an event-driven way. For example, right now we poll the cluster members to see that they have agreed on a leader, a term value, and an applied index. We could potentially expose channels for all of these things so we do this is an event-triggered way, but it would come at the cost of extra complexity in the raft code (though now that I think about it, maybe using the watch package would make it not so bad). If there is a need for the actual program to know about these things, providing events is almost certainly the right way to do it. But I'm not sure I want to create this event mechanism if polling works well enough for unit tests.

Hope this helps provide some context. These are some pretty interesting topics and I'm definitely interested in continuing to discuss them and iterate the approach. My main concern is that the tests accurately cover what we need them to. One could argue that hypothetical races could trigger when the state machine runs in real time, but not when we advance time artificially. This would be an argument against using this fake timebase at all. But the flip side is that having slow tests that take minutes to run instead of seconds discourages comprehensive test coverage. Perhaps as a compromise, we could use the fake timebase for go test -short, and otherwise run the tests in both ways.

@aaronlehmann
Copy link
Collaborator Author

Also, on the point of not all components allowing us to inject this timebase, I don't think that's really a concern. The reason for the fake timebase is to allow raft ticks to not take significant amount of real time (think of it like changing the tick time to 1 ms, except if we were to actually do this, we'd be making the tick time unreasonably fast for the state machine to keep up, and would probably cause instability). But for things like connection timeouts, we still want real time to apply - at least for all the existing tests.

@stevvooe
Copy link
Contributor

stevvooe commented Apr 6, 2016

Well put.

Ideally, we could switch out the time.Sleep to something like this:

triggerAsyncEvent()
waitForEventToHappen()
assert.True(t, eventHasHappened())

To deal with the general length of the tests, the actual parameters for the cluster could be reduced accordingly (ie use time.Millisecond instead of time.Second).

You can see my general struggle here. I'll continue to experiment with this, as the agent code base is much smaller.

LGTM

@abronan
Copy link
Contributor

abronan commented Apr 7, 2016

As for making the tests more event-driven, this is certainly possible, but would require exposing a lot more raft internals in an event-driven way

Can't agree more, we can definitely do that but I'm not sure this is worth the extra-cost in complexity especially if this benefits tests only.

@aaronlehmann My only question would be: any benefit of using pivotal-golang/clock over jonboulle/clockwork which is used by etcd? Cockroach also have their own implementation of a fake clock for tests. They might all be very similar but just wondering.

LGTM

(we went from 26s in the CI for store tests to 3.5s 🎉 🎈)

@aaronlehmann
Copy link
Collaborator Author

@abronan: I looked at clockwork first, but it doesn't support tickers, so I looked for alternatives and found pivotal-golang/clock, which gave a drop-in replacement for time.NewTicker.

@aaronlehmann aaronlehmann merged commit 87bac4f into moby:master Apr 7, 2016
@aaronlehmann aaronlehmann deleted the clockwork branch April 7, 2016 01:44
@stevvooe
Copy link
Contributor

stevvooe commented Apr 8, 2016

Can't agree more, we can definitely do that but I'm not sure this is worth the extra-cost in complexity especially if this benefits tests only.

I'm not sure if I agree with this statement. If we can't detect the desired states, we probably aren't getting it right.

@abronan
Copy link
Contributor

abronan commented Apr 8, 2016

@stevvooe polling is a mean like others to detect convergence of values. In this regard this should be enough for unit tests.

Detecting the desired state with an event model in unit tests (in contrast to just polling the values) does not make our distributed system model more sound, because in a true environment it will be subject to network partitions, message replays, etc. The event model will just make our tests more convenient to read or write because detecting the convergence will be easier through watches/channels. But as aaron rightfully mentioned, it comes at the cost of extra complexity in our raft internals.

The only purpose of raft is making sure that the state is consistent on a set of machines by making them agree on values and respecting a total order. But never the leader or any machine for that matter has to detect if the state applied locally on raft members converged or not. If a machine has applied the logs or not, or applied a log entry in the persistent db is none of our business because we assume it will do (at our level at least).

Also, it does not mean that we should ignore network partitions, etc. in tests. We can maybe include that in automated scale tests with tools like Jepsen (simulating failures) and analyze the db traces to detect inconsistencies in the desired state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get rid of time.Sleep in raft tests
4 participants