-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8531] [ML] Update ML user guide for MinMaxScaler #7211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -865,6 +865,7 @@ val scaledData = scalerModel.transform(dataFrame) | |
| {% highlight java %} | ||
| import org.apache.spark.api.java.JavaRDD; | ||
| import org.apache.spark.ml.feature.StandardScaler; | ||
| import org.apache.spark.ml.feature.StandardScalerModel; | ||
| import org.apache.spark.mllib.regression.LabeledPoint; | ||
| import org.apache.spark.mllib.util.MLUtils; | ||
| import org.apache.spark.sql.DataFrame; | ||
|
|
@@ -905,6 +906,74 @@ scaledData = scalerModel.transform(dataFrame) | |
| </div> | ||
| </div> | ||
|
|
||
| ## MinMaxScaler | ||
|
|
||
| `MinMaxScaler` transforms a dataset of `Vector` rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters: | ||
|
|
||
| * `min`: 0.0 by default. Lower bound after transformation, shared by all features. | ||
| * `max`: 1.0 by default. Upper bound after transformation, shared by all features. | ||
|
|
||
| `MinMaxScaler` computes summary statistics on a data set and produces a `MinMaxScalerModel`. The model can then transform each feature individually such that it is in the given range. | ||
|
|
||
| The rescaled value for a feature E is calculated as, | ||
|
|
||
| Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you please make this render as Latex? You can follow the examples in mllib-linear-methods.md Same for other equations below.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, do you mean adding |
||
|
|
||
| For the case E_{max} == E_{min}, Rescaled(e_i) = 0.5 * (max + min) | ||
|
|
||
| Note that since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input. | ||
|
|
||
| More details can be found in the API docs for | ||
| [MinMaxScaler](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you please put this in the Scala code tab like this: And then could you also please add a reference to the other APIs under those code tabs? Thanks! (I'm trying to follow this pattern nowadays.) |
||
| [MinMaxScalerModel](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel). | ||
|
|
||
| The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [0, 1]. | ||
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala"> | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The links are not generated correctly, but you can fix it by modifying this line: |
||
| {% highlight scala %} | ||
| import org.apache.spark.ml.feature.MinMaxScaler | ||
| import org.apache.spark.mllib.util.MLUtils | ||
|
|
||
| val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") | ||
| val dataFrame = sqlContext.createDataFrame(data) | ||
| val scaler = new MinMaxScaler() | ||
| .setInputCol("features") | ||
| .setOutputCol("scaledFeatures") | ||
|
|
||
| // Compute summary statistics by fitting the StandardScaler | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please update the inline comments. |
||
| val scalerModel = scaler.fit(dataFrame) | ||
|
|
||
| // Normalize each feature to have unit standard deviation. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please update the inline comments. |
||
| val scaledData = scalerModel.transform(dataFrame) | ||
| {% endhighlight %} | ||
| </div> | ||
|
|
||
| <div data-lang="java"> | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here (add |
||
| {% highlight java %} | ||
| import org.apache.spark.api.java.JavaRDD; | ||
| import org.apache.spark.ml.feature.MinMaxScaler; | ||
| import org.apache.spark.ml.feature.MinMaxScalerModel; | ||
| import org.apache.spark.mllib.regression.LabeledPoint; | ||
| import org.apache.spark.mllib.util.MLUtils; | ||
| import org.apache.spark.sql.DataFrame; | ||
|
|
||
| JavaRDD<LabeledPoint> data = | ||
| MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD(); | ||
| DataFrame dataFrame = jsql.createDataFrame(data, LabeledPoint.class); | ||
| MinMaxScaler scaler = new MinMaxScaler() | ||
| .setInputCol("features") | ||
| .setOutputCol("scaledFeatures"); | ||
|
|
||
| // Compute summary statistics by fitting the StandardScaler | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please update the inline comments. |
||
| MinMaxScalerModel scalerModel = scaler.fit(dataFrame); | ||
|
|
||
| // Normalize each feature to have unit standard deviation. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please update the inline comments. |
||
| DataFrame scaledData = scalerModel.transform(dataFrame); | ||
| {% endhighlight %} | ||
| </div> | ||
| </div> | ||
|
|
||
| ## Bucketizer | ||
|
|
||
| `Bucketizer` transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users. It takes a parameter: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!