-
Notifications
You must be signed in to change notification settings - Fork 17
[SPARK-18471][CORE] New treeAggregate overload for big large aggregators #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1111,21 +1111,24 @@ abstract class RDD[T: ClassTag]( | |
| /** | ||
| * Aggregates the elements of this RDD in a multi-level tree pattern. | ||
| * | ||
| * This variant with a function generating the zero, provide for efficiently | ||
| * running on big aggregation structure like large dense vectors | ||
| * | ||
| * @param depth suggested depth of the tree (default: 2) | ||
| * @see [[org.apache.spark.rdd.RDD#aggregate]] | ||
| */ | ||
| def treeAggregate[U: ClassTag](zeroValue: U)( | ||
| def treeAggregateWithZeroGenerator[U: ClassTag](zeroValueGenerator: () => U)( | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could use a lazy parameter here instead of an explicit closure, see e.g. this blog post.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great suggestion ! I'm not sure however how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the blog post, it should be exactly what we want. But yeah, we need to test
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's discuss/test this live because I reckon that the value will be instantiated when thee serializer is called, which is NOT what we want. |
||
| seqOp: (U, T) => U, | ||
| combOp: (U, U) => U, | ||
| depth: Int = 2): U = withScope { | ||
| require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.") | ||
| if (partitions.length == 0) { | ||
| Utils.clone(zeroValue, context.env.closureSerializer.newInstance()) | ||
| Utils.clone(zeroValueGenerator(), context.env.closureSerializer.newInstance()) | ||
| } else { | ||
| val cleanSeqOp = context.clean(seqOp) | ||
| val cleanCombOp = context.clean(combOp) | ||
| val aggregatePartition = | ||
| (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp) | ||
| (it: Iterator[T]) => it.aggregate(zeroValueGenerator())(cleanSeqOp, cleanCombOp) | ||
| var partiallyAggregated = mapPartitions(it => Iterator(aggregatePartition(it))) | ||
| var numPartitions = partiallyAggregated.partitions.length | ||
| val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 / depth)).toInt, 2) | ||
|
|
@@ -1144,6 +1147,18 @@ abstract class RDD[T: ClassTag]( | |
| } | ||
| } | ||
|
|
||
| /** | ||
| * Aggregates the elements of this RDD in a multi-level tree pattern. | ||
| * | ||
| * @param depth suggested depth of the tree (default: 2) | ||
| * @see [[org.apache.spark.rdd.RDD#aggregate]] | ||
| */ | ||
| def treeAggregate[U: ClassTag](zeroValue: U)( | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OPT: Please keep BFS order of method declarations
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BFS ? What do you mean ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Breadth first (search) order
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
| seqOp: (U, T) => U, | ||
| combOp: (U, U) => U, | ||
| depth: Int = 2): U = | ||
| treeAggregateWithZeroGenerator(() => zeroValue)(seqOp, combOp, depth) | ||
|
|
||
| /** | ||
| * Return the number of elements in the RDD. | ||
| */ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think where should be no mentions of vector manipulations in core APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done