-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10372] [CORE] basic test framework for entire spark scheduler #8559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #41877 has finished for PR 8559 at commit
|
|
Jenkins, retest this please |
| @@ -1123,8 +1133,15 @@ class DAGScheduler( | |||
| // TODO: Cancel running tasks in the stage | |||
| logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " + | |||
| s"$failedStage (${failedStage.name}) due to fetch failure") | |||
| // We might get lots of fetch failed for this stage, from lots of executors. | |||
| // Its better if we can resubmit for all the failed executors at one time, so lets | |||
| // just wait a *bit* before we resubmit. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea if this comment is accurate or not -- I just found this pretty confusing and felt it deserved some comment, so I took my best guess. The tests in DAGSchedulerSuite pass if you just post this event directly (with other corresponding changes to go along with it ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #8560 to show that you can remove the messageScheduler completely (the tests pass, anyway ...)
|
Test build #41878 has finished for PR 8559 at commit
|
844bd33 to
093a643
Compare
|
Test build #43989 has finished for PR 8559 at commit
|
|
Test build #45102 has finished for PR 8559 at commit
|
|
Test build #45448 has finished for PR 8559 at commit
|
|
Test build #47164 has finished for PR 8559 at commit
|
d61b13b to
50d9162
Compare
|
Test build #58266 has finished for PR 8559 at commit
|
|
Test build #58275 has finished for PR 8559 at commit
|
46ebdb6 to
6e8562f
Compare
|
Test build #58590 has finished for PR 8559 at commit
|
6e8562f to
0b6da39
Compare
|
Test build #58593 has finished for PR 8559 at commit
|
0b6da39 to
c091187
Compare
|
Test build #58704 has finished for PR 8559 at commit
|
|
@mateiz @markhamstra @kayousterhout I've updated this to avoid changing the scheduler at all, and instead just run the mock backend in another thread. It slightly complicates writing the tests, but avoids a (small) race in the tests and was probably necessary when we got to more complicated tests in any case. In particular I want this for more tests on blacklisting (SPARK-8426), but I think there are a handful of open scheduler issues which will benefit from it. |
|
Test build #58712 has finished for PR 8559 at commit
|
|
sorry one more update -- I realized that some of the abstractions were too complicated for the simple tests I had, they were a step towards some more complex tests involving multiple executors. I decided to just go ahead and add an example of those tests as well (which actually required some other minor changes in any case). I'm sure there will be more to add to the framework at some point, but I think this is a good start for letting us add more tests. |
|
Jenkins, retest this please |
|
Test build #59038 has finished for PR 8559 at commit
|
| val numUpdates = in.readInt | ||
| if (numUpdates == 0) { | ||
| accumUpdates = null | ||
| accumUpdates = Seq() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised that we were getting away with that, given the number of places that are using DirectTaskResult#accumUpdates without checking for null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, likewise. I guess in all real spark jobs, the number of accum updates is > 0, which is why this didn't particularly matter except for these mocks. OTOH if this has any real danger we should just fix this separately (and be sure to include it in 2.0)
|
@markhamstra thanks for the review. I went ahead and added one more commit and removed |
|
@squito I don't feel really strongly about it, either. My only concern was for others adding tests needing a mock ExternalClusterManager in the future and not knowing which one to use, why they are different, whether any new mocking needs to be added to both, etc. If someone does feel strongly about maintaining the separation, then we can put things back the way you had them. |
|
Test build #59130 has finished for PR 8559 at commit
|
|
@markhamstra alright, in that case do you have any objections if I merge this, one commit back? I'll throw in a comment on DummyExternalClusterManager pointing to MockExternalClusterManager, as I think most folks writing tests would rather start from there. |
|
@squito Yeah, that's fine. I haven't gone through the new tests closely to make sure that they are doing what they say they are doing, but the changes to both non-test code and previous tests look safe. |
1fdf2aa to
278122e
Compare
|
Test build #59226 has finished for PR 8559 at commit
|
|
This is pretty cool! |
|
on a related note, @squito can you in the future leave a msg indicating the branch a pr was merged once you merge it? There have been cases that lead to race conditions in merging and also mistakes in the branches that we needed to go back and audit. |
|
sure thing, sorry about that @rxin ! fwiw, this was just merged to master. |
|
np - a lot of people forget to do this. I am just bugging everybody :) |
|
btw i've seen a few flaky tests introduced by this one. If we see more of them we might need to disable and increase timeouts. |
|
Oh thanks for pointing out the flaky test, I think I figured out the problem, opened https://issues.apache.org/jira/browse/SPARK-15714. Weird I never this running locally -- I'd expect this race to be pretty common. Just tried and an on my laptop it only shows up once in 100 runs. |
|
About the flaky test (checked locally on master through my IDE its easily reproducible): `Map(8 -> 42, 7 -> 42) was not empty .... |
|
Locally i get now: Futures timed out after [1 second] |
|
@skonto since you seem to be able to reproduce this regularly though I can't -- can you try just increasing the timeouts? honestly I'm surprised you're hitting them, but if we just need to bump up the timeouts a little more, we can do that. If there is something else wrong, eg. another race, than lets turn these tests off for now till I can nail it down. |
|
@squito not really check here: |
|
@skonto agreed, I have opened https://issues.apache.org/jira/browse/SPARK-15783 + PR #13528 to turn them off for now. I was at least able to reproduce the failures with 5k runs on my laptop, (also seems worse when I run all the different tests together vs just running one test repeatedly) so I will try to nail down the cause more before adding back. |
This is a basic framework for testing the entire scheduler. The tests this adds aren't very interesting -- the point of this PR is just to setup the framework, to keep the initial change small, but it can be built upon to test more features (eg., speculation, killing tasks, blacklisting, etc.).