Skip to content

Commit abf1b78

Browse files
committed
update with apache master
2 parents 3ca68eb + f1069b8 commit abf1b78

File tree

57 files changed

+1517
-990
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+1517
-990
lines changed

docs/README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ You can modify the default Jekyll build as follows:
4343
## Pygments
4444

4545
We also use pygments (http://pygments.org) for syntax highlighting in documentation markdown pages,
46-
so you will also need to install that (it requires Python) by running `sudo easy_install Pygments`.
46+
so you will also need to install that (it requires Python) by running `sudo pip install Pygments`.
4747

4848
To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile
4949
phase, use the following sytax:
@@ -53,6 +53,11 @@ phase, use the following sytax:
5353
// supported languages too.
5454
{% endhighlight %}
5555

56+
## Sphinx
57+
58+
We use Sphinx to generate Python API docs, so you will need to install it by running
59+
`sudo pip install sphinx`.
60+
5661
## API Docs (Scaladoc and Sphinx)
5762

5863
You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PROJECT_ROOT directory.

docs/graphx-programming-guide.md

Lines changed: 7 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -57,77 +57,15 @@ title: GraphX Programming Guide
5757

5858
# Overview
5959

60-
GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high level,
61-
GraphX extends the Spark [RDD](api/scala/index.html#org.apache.spark.rdd.RDD) by introducing the
62-
[Resilient Distributed Property Graph](#property_graph): a directed multigraph with properties
60+
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level,
61+
GraphX extends the Spark [RDD](api/scala/index.html#org.apache.spark.rdd.RDD) by introducing a
62+
new [Graph](#property_graph) abstraction: a directed multigraph with properties
6363
attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental
6464
operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
65-
[aggregateMessages](#aggregateMessages)) as well as an optimized variant of the [Pregel](#pregel) API. In
66-
addition, GraphX includes a growing collection of graph [algorithms](#graph_algorithms) and
65+
[aggregateMessages](#aggregateMessages)) as well as an optimized variant of the [Pregel](#pregel) API. In addition, GraphX includes a growing collection of graph [algorithms](#graph_algorithms) and
6766
[builders](#graph_builders) to simplify graph analytics tasks.
6867

6968

70-
## Motivation
71-
72-
From social networks to language modeling, the growing scale and importance of
73-
graph data has driven the development of numerous new *graph-parallel* systems
74-
(e.g., [Giraph](http://giraph.apache.org) and
75-
[GraphLab](http://graphlab.org)). By restricting the types of computation that can be
76-
expressed and introducing new techniques to partition and distribute graphs,
77-
these systems can efficiently execute sophisticated graph algorithms orders of
78-
magnitude faster than more general *data-parallel* systems.
79-
80-
<p style="text-align: center;">
81-
<img src="img/data_parallel_vs_graph_parallel.png"
82-
title="Data-Parallel vs. Graph-Parallel"
83-
alt="Data-Parallel vs. Graph-Parallel"
84-
width="50%" />
85-
<!-- Images are downsized intentionally to improve quality on retina displays -->
86-
</p>
87-
88-
However, the same restrictions that enable these substantial performance gains also make it
89-
difficult to express many of the important stages in a typical graph-analytics pipeline:
90-
constructing the graph, modifying its structure, or expressing computation that spans multiple
91-
graphs. Furthermore, how we look at data depends on our objectives and the same raw data may have
92-
many different table and graph views.
93-
94-
<p style="text-align: center;">
95-
<img src="img/tables_and_graphs.png"
96-
title="Tables and Graphs"
97-
alt="Tables and Graphs"
98-
width="50%" />
99-
<!-- Images are downsized intentionally to improve quality on retina displays -->
100-
</p>
101-
102-
As a consequence, it is often necessary to be able to move between table and graph views.
103-
However, existing graph analytics pipelines must compose graph-parallel and data-
104-
parallel systems, leading to extensive data movement and duplication and a complicated programming
105-
model.
106-
107-
<p style="text-align: center;">
108-
<img src="img/graph_analytics_pipeline.png"
109-
title="Graph Analytics Pipeline"
110-
alt="Graph Analytics Pipeline"
111-
width="50%" />
112-
<!-- Images are downsized intentionally to improve quality on retina displays -->
113-
</p>
114-
115-
The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one
116-
system with a single composable API. The GraphX API enables users to view data both as a graph and
117-
as collections (i.e., RDDs) without data movement or duplication. By incorporating recent advances
118-
in graph-parallel systems, GraphX is able to optimize the execution of graph operations.
119-
120-
<!-- ## GraphX Replaces the Spark Bagel API
121-
122-
Prior to the release of GraphX, graph computation in Spark was expressed using Bagel, an
123-
implementation of Pregel. GraphX improves upon Bagel by exposing a richer property graph API, a
124-
more streamlined version of the Pregel abstraction, and system optimizations to improve performance
125-
and reduce memory overhead. While we plan to eventually deprecate Bagel, we will continue to
126-
support the [Bagel API](api/scala/index.html#org.apache.spark.bagel.package) and
127-
[Bagel programming guide](bagel-programming-guide.html). However, we encourage Bagel users to
128-
explore the new GraphX API and comment on issues that may complicate the transition from Bagel.
129-
-->
130-
13169
## Migrating from Spark 1.1
13270

13371
GraphX in Spark {{site.SPARK_VERSION}} contains a few user facing API changes:
@@ -174,7 +112,7 @@ identifiers.
174112
The property graph is parameterized over the vertex (`VD`) and edge (`ED`) types. These
175113
are the types of the objects associated with each vertex and edge respectively.
176114

177-
> GraphX optimizes the representation of vertex and edge types when they are plain old data types
115+
> GraphX optimizes the representation of vertex and edge types when they are primitive data types
178116
> (e.g., int, double, etc...) reducing the in memory footprint by storing them in specialized
179117
> arrays.
180118
@@ -791,14 +729,13 @@ Graphs are inherently recursive data structures as properties of vertices depend
791729
their neighbors which in turn depend on properties of *their* neighbors. As a
792730
consequence many important graph algorithms iteratively recompute the properties of each vertex
793731
until a fixed-point condition is reached. A range of graph-parallel abstractions have been proposed
794-
to express these iterative algorithms. GraphX exposes a Pregel-like operator which is a fusion of
795-
the widely used Pregel and GraphLab abstractions.
732+
to express these iterative algorithms. GraphX exposes a variant of the Pregel API.
796733

797734
At a high level the Pregel operator in GraphX is a bulk-synchronous parallel messaging abstraction
798735
*constrained to the topology of the graph*. The Pregel operator executes in a series of super steps
799736
in which vertices receive the *sum* of their inbound messages from the previous super step, compute
800737
a new value for the vertex property, and then send messages to neighboring vertices in the next
801-
super step. Unlike Pregel and instead more like GraphLab messages are computed in parallel as a
738+
super step. Unlike Pregel, messages are computed in parallel as a
802739
function of the edge triplet and the message computation has access to both the source and
803740
destination vertex attributes. Vertices that do not receive a message are skipped within a super
804741
step. The Pregel operators terminates iteration and returns the final graph when there are no
-423 KB
Binary file not shown.
-417 KB
Binary file not shown.

docs/img/tables_and_graphs.png

-162 KB
Binary file not shown.

examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostedTrees.java renamed to examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostedTreesRunner.java

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -27,18 +27,18 @@
2727
import org.apache.spark.api.java.function.Function2;
2828
import org.apache.spark.api.java.function.PairFunction;
2929
import org.apache.spark.mllib.regression.LabeledPoint;
30-
import org.apache.spark.mllib.tree.GradientBoosting;
30+
import org.apache.spark.mllib.tree.GradientBoostedTrees;
3131
import org.apache.spark.mllib.tree.configuration.BoostingStrategy;
32-
import org.apache.spark.mllib.tree.model.WeightedEnsembleModel;
32+
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel;
3333
import org.apache.spark.mllib.util.MLUtils;
3434

3535
/**
3636
* Classification and regression using gradient-boosted decision trees.
3737
*/
38-
public final class JavaGradientBoostedTrees {
38+
public final class JavaGradientBoostedTreesRunner {
3939

4040
private static void usage() {
41-
System.err.println("Usage: JavaGradientBoostedTrees <libsvm format data file>" +
41+
System.err.println("Usage: JavaGradientBoostedTreesRunner <libsvm format data file>" +
4242
" <Classification/Regression>");
4343
System.exit(-1);
4444
}
@@ -55,7 +55,7 @@ public static void main(String[] args) {
5555
if (args.length > 2) {
5656
usage();
5757
}
58-
SparkConf sparkConf = new SparkConf().setAppName("JavaGradientBoostedTrees");
58+
SparkConf sparkConf = new SparkConf().setAppName("JavaGradientBoostedTreesRunner");
5959
JavaSparkContext sc = new JavaSparkContext(sparkConf);
6060

6161
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD().cache();
@@ -64,7 +64,7 @@ public static void main(String[] args) {
6464
// Note: All features are treated as continuous.
6565
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams(algo);
6666
boostingStrategy.setNumIterations(10);
67-
boostingStrategy.weakLearnerParams().setMaxDepth(5);
67+
boostingStrategy.treeStrategy().setMaxDepth(5);
6868

6969
if (algo.equals("Classification")) {
7070
// Compute the number of classes from the data.
@@ -73,10 +73,10 @@ public static void main(String[] args) {
7373
return p.label();
7474
}
7575
}).countByValue().size();
76-
boostingStrategy.setNumClassesForClassification(numClasses); // ignored for Regression
76+
boostingStrategy.treeStrategy().setNumClassesForClassification(numClasses);
7777

7878
// Train a GradientBoosting model for classification.
79-
final WeightedEnsembleModel model = GradientBoosting.trainClassifier(data, boostingStrategy);
79+
final GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
8080

8181
// Evaluate model on training instances and compute training error
8282
JavaPairRDD<Double, Double> predictionAndLabel =
@@ -95,7 +95,7 @@ public static void main(String[] args) {
9595
System.out.println("Learned classification tree model:\n" + model);
9696
} else if (algo.equals("Regression")) {
9797
// Train a GradientBoosting model for classification.
98-
final WeightedEnsembleModel model = GradientBoosting.trainRegressor(data, boostingStrategy);
98+
final GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
9999

100100
// Evaluate model on training instances and compute training error
101101
JavaPairRDD<Double, Double> predictionAndLabel =

examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala

Lines changed: 4 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ import scopt.OptionParser
2222
import org.apache.spark.{SparkConf, SparkContext}
2323
import org.apache.spark.SparkContext._
2424
import org.apache.spark.mllib.evaluation.MulticlassMetrics
25+
import org.apache.spark.mllib.linalg.Vector
2526
import org.apache.spark.mllib.regression.LabeledPoint
26-
import org.apache.spark.mllib.tree.{RandomForest, DecisionTree, impurity}
27+
import org.apache.spark.mllib.tree.{DecisionTree, RandomForest, impurity}
2728
import org.apache.spark.mllib.tree.configuration.{Algo, Strategy}
2829
import org.apache.spark.mllib.tree.configuration.Algo._
29-
import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, DecisionTreeModel}
3030
import org.apache.spark.mllib.util.MLUtils
3131
import org.apache.spark.rdd.RDD
3232
import org.apache.spark.util.Utils
@@ -349,24 +349,14 @@ object DecisionTreeRunner {
349349
sc.stop()
350350
}
351351

352-
/**
353-
* Calculates the mean squared error for regression.
354-
*/
355-
private def meanSquaredError(tree: DecisionTreeModel, data: RDD[LabeledPoint]): Double = {
356-
data.map { y =>
357-
val err = tree.predict(y.features) - y.label
358-
err * err
359-
}.mean()
360-
}
361-
362352
/**
363353
* Calculates the mean squared error for regression.
364354
*/
365355
private[mllib] def meanSquaredError(
366-
tree: WeightedEnsembleModel,
356+
model: { def predict(features: Vector): Double },
367357
data: RDD[LabeledPoint]): Double = {
368358
data.map { y =>
369-
val err = tree.predict(y.features) - y.label
359+
val err = model.predict(y.features) - y.label
370360
err * err
371361
}.mean()
372362
}

examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostedTrees.scala renamed to examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostedTreesRunner.scala

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,21 +21,21 @@ import scopt.OptionParser
2121

2222
import org.apache.spark.{SparkConf, SparkContext}
2323
import org.apache.spark.mllib.evaluation.MulticlassMetrics
24-
import org.apache.spark.mllib.tree.GradientBoosting
24+
import org.apache.spark.mllib.tree.GradientBoostedTrees
2525
import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, Algo}
2626
import org.apache.spark.util.Utils
2727

2828
/**
2929
* An example runner for Gradient Boosting using decision trees as weak learners. Run with
3030
* {{{
31-
* ./bin/run-example org.apache.spark.examples.mllib.GradientBoostedTrees [options]
31+
* ./bin/run-example mllib.GradientBoostedTreesRunner [options]
3232
* }}}
3333
* If you use it as a template to create your own app, please use `spark-submit` to submit your app.
3434
*
3535
* Note: This script treats all features as real-valued (not categorical).
3636
* To include categorical features, modify categoricalFeaturesInfo.
3737
*/
38-
object GradientBoostedTrees {
38+
object GradientBoostedTreesRunner {
3939

4040
case class Params(
4141
input: String = null,
@@ -93,24 +93,24 @@ object GradientBoostedTrees {
9393

9494
def run(params: Params) {
9595

96-
val conf = new SparkConf().setAppName(s"GradientBoostedTrees with $params")
96+
val conf = new SparkConf().setAppName(s"GradientBoostedTreesRunner with $params")
9797
val sc = new SparkContext(conf)
9898

99-
println(s"GradientBoostedTrees with parameters:\n$params")
99+
println(s"GradientBoostedTreesRunner with parameters:\n$params")
100100

101101
// Load training and test data and cache it.
102102
val (training, test, numClasses) = DecisionTreeRunner.loadDatasets(sc, params.input,
103103
params.dataFormat, params.testInput, Algo.withName(params.algo), params.fracTest)
104104

105105
val boostingStrategy = BoostingStrategy.defaultParams(params.algo)
106-
boostingStrategy.numClassesForClassification = numClasses
106+
boostingStrategy.treeStrategy.numClassesForClassification = numClasses
107107
boostingStrategy.numIterations = params.numIterations
108-
boostingStrategy.weakLearnerParams.maxDepth = params.maxDepth
108+
boostingStrategy.treeStrategy.maxDepth = params.maxDepth
109109

110110
val randomSeed = Utils.random.nextInt()
111111
if (params.algo == "Classification") {
112112
val startTime = System.nanoTime()
113-
val model = GradientBoosting.trainClassifier(training, boostingStrategy)
113+
val model = GradientBoostedTrees.train(training, boostingStrategy)
114114
val elapsedTime = (System.nanoTime() - startTime) / 1e9
115115
println(s"Training time: $elapsedTime seconds")
116116
if (model.totalNumNodes < 30) {
@@ -127,7 +127,7 @@ object GradientBoostedTrees {
127127
println(s"Test accuracy = $testAccuracy")
128128
} else if (params.algo == "Regression") {
129129
val startTime = System.nanoTime()
130-
val model = GradientBoosting.trainRegressor(training, boostingStrategy)
130+
val model = GradientBoostedTrees.train(training, boostingStrategy)
131131
val elapsedTime = (System.nanoTime() - startTime) / 1e9
132132
println(s"Training time: $elapsedTime seconds")
133133
if (model.totalNumNodes < 30) {

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,10 @@ import org.apache.spark.mllib.regression._
4040
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
4141
import org.apache.spark.mllib.stat.correlation.CorrelationNames
4242
import org.apache.spark.mllib.stat.test.ChiSqTestResult
43-
import org.apache.spark.mllib.tree.DecisionTree
43+
import org.apache.spark.mllib.tree.{RandomForest, DecisionTree}
4444
import org.apache.spark.mllib.tree.configuration.{Algo, Strategy}
4545
import org.apache.spark.mllib.tree.impurity._
46-
import org.apache.spark.mllib.tree.model.DecisionTreeModel
46+
import org.apache.spark.mllib.tree.model.{RandomForestModel, DecisionTreeModel}
4747
import org.apache.spark.mllib.util.MLUtils
4848
import org.apache.spark.rdd.RDD
4949
import org.apache.spark.storage.StorageLevel
@@ -499,6 +499,40 @@ class PythonMLLibAPI extends Serializable {
499499
DecisionTree.train(data.rdd, strategy)
500500
}
501501

502+
/**
503+
* Java stub for Python mllib RandomForest.train().
504+
* This stub returns a handle to the Java object instead of the content of the Java object.
505+
* Extra care needs to be taken in the Python code to ensure it gets freed on exit;
506+
* see the Py4J documentation.
507+
*/
508+
def trainRandomForestModel(
509+
data: JavaRDD[LabeledPoint],
510+
algoStr: String,
511+
numClasses: Int,
512+
categoricalFeaturesInfo: JMap[Int, Int],
513+
numTrees: Int,
514+
featureSubsetStrategy: String,
515+
impurityStr: String,
516+
maxDepth: Int,
517+
maxBins: Int,
518+
seed: Int): RandomForestModel = {
519+
520+
val algo = Algo.fromString(algoStr)
521+
val impurity = Impurities.fromString(impurityStr)
522+
val strategy = new Strategy(
523+
algo = algo,
524+
impurity = impurity,
525+
maxDepth = maxDepth,
526+
numClassesForClassification = numClasses,
527+
maxBins = maxBins,
528+
categoricalFeaturesInfo = categoricalFeaturesInfo.asScala.toMap)
529+
if (algo == Algo.Classification) {
530+
RandomForest.trainClassifier(data.rdd, strategy, numTrees, featureSubsetStrategy, seed)
531+
} else {
532+
RandomForest.trainRegressor(data.rdd, strategy, numTrees, featureSubsetStrategy, seed)
533+
}
534+
}
535+
502536
/**
503537
* Java stub for mllib Statistics.colStats(X: RDD[Vector]).
504538
* TODO figure out return type.

0 commit comments

Comments
 (0)