Skip to content

Commit e32eb40

Browse files
committed
Merge branch 'master' into SPARK-6263
2 parents 25d3c9d + 47c1d56 commit e32eb40

File tree

9 files changed

+118
-36
lines changed

9 files changed

+118
-36
lines changed

docs/ml-guide.md

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,23 @@ layout: global
33
title: Spark ML Programming Guide
44
---
55

6-
`spark.ml` is a new package introduced in Spark 1.2, which aims to provide a uniform set of
6+
Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
77
high-level APIs that help users create and tune practical machine learning pipelines.
8-
It is currently an alpha component, and we would like to hear back from the community about
9-
how it fits real-world use cases and how it could be improved.
8+
9+
*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi`.
1010

1111
Note that we will keep supporting and adding features to `spark.mllib` along with the
1212
development of `spark.ml`.
1313
Users should be comfortable using `spark.mllib` features and expect more features coming.
1414
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
1515
to `spark.ml`.
1616

17+
Guides for sub-packages of `spark.ml` include:
18+
19+
* [Feature Extraction, Transformation, and Selection](ml-features.html): Details on transformers supported in the Pipelines API, including a few not in the lower-level `spark.mllib` API
20+
* [Ensembles](ml-ensembles.html): Details on ensemble learning methods in the Pipelines API
21+
22+
1723
**Table of Contents**
1824

1925
* This will become a table of contents (this text will be scraped).
@@ -148,16 +154,6 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s.
148154
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
149155
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
150156

151-
# Algorithm Guides
152-
153-
There are now several algorithms in the Pipelines API which are not in the lower-level MLlib API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
154-
155-
**Pipelines API Algorithm Guides**
156-
157-
* [Feature Extraction, Transformation, and Selection](ml-features.html)
158-
* [Ensembles](ml-ensembles.html)
159-
160-
161157
# Code Examples
162158

163159
This section gives code examples illustrating the functionality discussed above.
@@ -783,6 +779,16 @@ Spark ML also depends upon Spark SQL, but the relevant parts of Spark SQL do not
783779

784780
# Migration Guide
785781

782+
## From 1.3 to 1.4
783+
784+
Several major API changes occurred, including:
785+
* `Param` and other APIs for specifying parameters
786+
* `uid` unique IDs for Pipeline components
787+
* Reorganization of certain classes
788+
Since the `spark.ml` API was an Alpha Component in Spark 1.3, we do not list all changes here.
789+
790+
However, now that `spark.ml` is no longer an Alpha Component, we will provide details on any API changes for future releases.
791+
786792
## From 1.2 to 1.3
787793

788794
The main API changes are from Spark SQL. We list the most important changes here:

docs/mllib-feature-extraction.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -576,8 +576,9 @@ parsedData = data.map(lambda x: [float(t) for t in x.split(" ")])
576576
transformingVector = Vectors.dense([0.0, 1.0, 2.0])
577577
transformer = ElementwiseProduct(transformingVector)
578578

579-
# Batch transform.
579+
# Batch transform
580580
transformedData = transformer.transform(parsedData)
581+
# Single-row transform
581582
transformedData2 = transformer.transform(parsedData.first())
582583

583584
{% endhighlight %}

docs/mllib-guide.md

Lines changed: 28 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,19 @@ description: MLlib machine learning library overview for Spark SPARK_VERSION_SHO
77

88
MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
99
including classification, regression, clustering, collaborative
10-
filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
10+
filtering, dimensionality reduction, as well as underlying optimization primitives.
11+
Guides for individual algorithms are listed below.
12+
13+
The API is divided into 2 parts:
14+
15+
* [The original `spark.mllib` API](mllib-guide.html#mllib-types-algorithms-and-utilities) is the primary API.
16+
* [The "Pipelines" `spark.ml` API](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) is a higher-level API for constructing ML workflows.
17+
18+
We list major functionality from both below, with links to detailed guides.
19+
20+
# MLlib types, algorithms and utilities
21+
22+
This lists functionality included in `spark.mllib`, the main MLlib API.
1123

1224
* [Data types](mllib-data-types.html)
1325
* [Basic statistics](mllib-statistics.html)
@@ -49,16 +61,20 @@ and the migration guide below will explain all changes between releases.
4961

5062
Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
5163
high-level APIs that help users create and tune practical machine learning pipelines.
52-
It is currently an alpha component, and we would like to hear back from the community about
53-
how it fits real-world use cases and how it could be improved.
64+
65+
*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi`.
5466

5567
Note that we will keep supporting and adding features to `spark.mllib` along with the
5668
development of `spark.ml`.
5769
Users should be comfortable using `spark.mllib` features and expect more features coming.
5870
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
5971
to `spark.ml`.
6072

61-
See the **[spark.ml programming guide](ml-guide.html)** for more information on this package.
73+
More detailed guides for `spark.ml` include:
74+
75+
* **[spark.ml programming guide](ml-guide.html)**: overview of the Pipelines API and major concepts
76+
* [Feature transformers](ml-features.html): Details on transformers supported in the Pipelines API, including a few not in the lower-level `spark.mllib` API
77+
* [Ensembles](ml-ensembles.html): Details on ensemble learning methods in the Pipelines API
6278

6379
# Dependencies
6480

@@ -90,21 +106,14 @@ version 1.4 or newer.
90106

91107
For the `spark.ml` package, please see the [spark.ml Migration Guide](ml-guide.html#migration-guide).
92108

93-
## From 1.2 to 1.3
94-
95-
In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
96-
97-
* *(Breaking change)* In [`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the extraneous method `solveLeastSquares` has been removed. The `DeveloperApi` method `analyzeBlocks` was also removed.
98-
* *(Breaking change)* [`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method. To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`.
99-
* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD) remains an Experimental component. In it, there were two changes:
100-
* The constructor taking arguments was removed in favor of a builder patten using the default constructor plus parameter setter methods.
101-
* Variable `model` is no longer public.
102-
* *(Breaking change)* [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) remains an Experimental component. In it and its associated classes, there were several changes:
103-
* In `DecisionTree`, the deprecated class method `train` has been removed. (The object/static `train` methods remain.)
104-
* In `Strategy`, the `checkpointDir` parameter has been removed. Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training.
105-
* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`. This was never meant for external use.
106-
* In linear regression (including Lasso and ridge regression), the squared loss is now divided by 2.
107-
So in order to produce the same result as in 1.2, the regularization parameter needs to be divided by 2 and the step size needs to be multiplied by 2.
109+
## From 1.3 to 1.4
110+
111+
In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
112+
113+
* Gradient-Boosted Trees
114+
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
115+
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
116+
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
108117

109118
## Previous Spark Versions
110119

docs/mllib-migration-guides.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,22 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
77

88
The migration guide for the current Spark version is kept on the [MLlib Programming Guide main page](mllib-guide.html#migration-guide).
99

10+
## From 1.2 to 1.3
11+
12+
In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
13+
14+
* *(Breaking change)* In [`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the extraneous method `solveLeastSquares` has been removed. The `DeveloperApi` method `analyzeBlocks` was also removed.
15+
* *(Breaking change)* [`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method. To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`.
16+
* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD) remains an Experimental component. In it, there were two changes:
17+
* The constructor taking arguments was removed in favor of a builder pattern using the default constructor plus parameter setter methods.
18+
* Variable `model` is no longer public.
19+
* *(Breaking change)* [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) remains an Experimental component. In it and its associated classes, there were several changes:
20+
* In `DecisionTree`, the deprecated class method `train` has been removed. (The object/static `train` methods remain.)
21+
* In `Strategy`, the `checkpointDir` parameter has been removed. Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training.
22+
* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`. This was never meant for external use.
23+
* In linear regression (including Lasso and ridge regression), the squared loss is now divided by 2.
24+
So in order to produce the same result as in 1.2, the regularization parameter needs to be divided by 2 and the step size needs to be multiplied by 2.
25+
1026
## From 1.1 to 1.2
1127

1228
The only API changes in MLlib v1.2 are in

mllib/src/main/scala/org/apache/spark/ml/attribute/attributes.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ package org.apache.spark.ml.attribute
2020
import scala.annotation.varargs
2121

2222
import org.apache.spark.annotation.DeveloperApi
23-
import org.apache.spark.sql.types.{DoubleType, Metadata, MetadataBuilder, StructField}
23+
import org.apache.spark.sql.types.{DoubleType, NumericType, Metadata, MetadataBuilder, StructField}
2424

2525
/**
2626
* :: DeveloperApi ::
@@ -127,7 +127,7 @@ private[attribute] trait AttributeFactory {
127127
* Creates an [[Attribute]] from a [[StructField]] instance.
128128
*/
129129
def fromStructField(field: StructField): Attribute = {
130-
require(field.dataType == DoubleType)
130+
require(field.dataType.isInstanceOf[NumericType])
131131
val metadata = field.metadata
132132
val mlAttr = AttributeKeys.ML_ATTR
133133
if (metadata.contains(mlAttr)) {

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -528,6 +528,16 @@ private[python] class PythonMLLibAPI extends Serializable {
528528
new ChiSqSelector(numTopFeatures).fit(data.rdd)
529529
}
530530

531+
/**
532+
* Java stub for PCA.fit(). This stub returns a
533+
* handle to the Java object instead of the content of the Java object.
534+
* Extra care needs to be taken in the Python code to ensure it gets freed on
535+
* exit; see the Py4J documentation.
536+
*/
537+
def fitPCA(k: Int, data: JavaRDD[Vector]): PCAModel = {
538+
new PCA(k).fit(data.rdd)
539+
}
540+
531541
/**
532542
* Java stub for IDF.fit(). This stub returns a
533543
* handle to the Java object instead of the content of the Java object.

mllib/src/test/scala/org/apache/spark/ml/attribute/AttributeSuite.scala

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,5 +215,10 @@ class AttributeSuite extends SparkFunSuite {
215215
assert(Attribute.fromStructField(fldWithoutMeta) == UnresolvedAttribute)
216216
val fldWithMeta = new StructField("x", DoubleType, false, metadata)
217217
assert(Attribute.fromStructField(fldWithMeta).isNumeric)
218+
// Attribute.fromStructField should accept any NumericType, not just DoubleType
219+
val longFldWithMeta = new StructField("x", LongType, false, metadata)
220+
assert(Attribute.fromStructField(longFldWithMeta).isNumeric)
221+
val decimalFldWithMeta = new StructField("x", DecimalType(None), false, metadata)
222+
assert(Attribute.fromStructField(decimalFldWithMeta).isNumeric)
218223
}
219224
}

python/pyspark/mllib/feature.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,41 @@ def fit(self, data):
252252
return ChiSqSelectorModel(jmodel)
253253

254254

255+
class PCAModel(JavaVectorTransformer):
256+
"""
257+
Model fitted by [[PCA]] that can project vectors to a low-dimensional space using PCA.
258+
"""
259+
260+
261+
class PCA(object):
262+
"""
263+
A feature transformer that projects vectors to a low-dimensional space using PCA.
264+
265+
>>> data = [Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),
266+
... Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),
267+
... Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0])]
268+
>>> model = PCA(2).fit(sc.parallelize(data))
269+
>>> pcArray = model.transform(Vectors.sparse(5, [(1, 1.0), (3, 7.0)])).toArray()
270+
>>> pcArray[0]
271+
1.648...
272+
>>> pcArray[1]
273+
-4.013...
274+
"""
275+
def __init__(self, k):
276+
"""
277+
:param k: number of principal components.
278+
"""
279+
self.k = int(k)
280+
281+
def fit(self, data):
282+
"""
283+
Computes a [[PCAModel]] that contains the principal components of the input vectors.
284+
:param data: source vectors
285+
"""
286+
jmodel = callMLlibFunc("fitPCA", self.k, data)
287+
return PCAModel(jmodel)
288+
289+
255290
class HashingTF(object):
256291
"""
257292
.. note:: Experimental

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -653,7 +653,7 @@ class SQLQuerySuite extends QueryTest {
653653
.queryExecution.toRdd.count())
654654
}
655655

656-
test("test script transform for stderr") {
656+
ignore("test script transform for stderr") {
657657
val data = (1 to 100000).map { i => (i, i, i) }
658658
data.toDF("d1", "d2", "d3").registerTempTable("script_trans")
659659
assert(0 ===

0 commit comments

Comments
 (0)