[SPARK-5374][CORE] abstract RDD's DAG graph iteration in DAGScheduler#4134
[SPARK-5374][CORE] abstract RDD's DAG graph iteration in DAGScheduler#4134cloud-fan wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
rdd in getMissingParentStages is not always stage's rdd. stage has a sequence of rdds and stage's rdd is child of stage. if stage's rdd at front of child rdd is cached, this line cannot filter this stage. but getMissingParentStages can complete it.
|
ping @JoshRosen |
|
I'm pretty busy with other work at the moment, so it'll be a little while before I can actually review this, but I'd be glad to let Jenkins test it to see whether it uncovers any problems (like I hit in my original patch). Jenkins, this is ok to test. |
|
Thanks for doing it. I took a quick look at this. While it does reduce the LOC, I feel the change is not necessary and actually makes the code harder to understand with the closures. Do we really want something like this? |
|
I'll take a deeper look over the weekend, but on a first pass I had a similar reaction to @rxin -- I'm not seeing a lot of benefit in terms of code clarity or maintainability, and we tend to avoid making changes to the DAGScheduler that don't offer significant benefits. |
|
Can one of the admins verify this patch? |
|
Closing this one, will do a more meaningful |
There are many methods in
DAGSchedulerthat iterate an RDD's DAG graph such asgetParentStages,getMissingParentStagesand so on. We should abstract this process to reduce code size.