Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
65fe902
[SPARK-19598][SQL] Remove the alias parameter in UnresolvedRelation
windpiger Feb 20, 2017
776b8f1
[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter
cloud-fan Feb 20, 2017
d0ecca6
[SPARK-19646][CORE][STREAMING] binaryRecords replicates records in sc…
srowen Feb 20, 2017
ead4ba0
[SPARK-15453][SQL][FOLLOW-UP] FileSourceScanExec to extract `outputOr…
gatorsmile Feb 20, 2017
0733a54
[SPARK-19669][SQL] Open up visibility for sharedState, sessionState, …
rxin Feb 20, 2017
73f0655
[SPARK-19669][HOTFIX][SQL] sessionState access privileges compiled fa…
windpiger Feb 21, 2017
3394191
[SPARK-19508][CORE] Improve error message when binding service fails
viirya Feb 21, 2017
17b93b5
[SPARK-18922][TESTS] Fix new test failures on Windows due to path and…
HyukjinKwon Feb 21, 2017
280afe0
[SPARK-19337][ML][DOC] Documentation and examples for LinearSVC
YY-OnCall Feb 21, 2017
7363dde
[SPARK-19626][YARN] Using the correct config to set credentials updat…
yaooqinn Feb 21, 2017
17d83e1
[SPARK-19652][UI] Do auth checks for REST API access.
Feb 22, 2017
1a45d2b
[SPARK-19670][SQL][TEST] Enable Bucketed Table Reading and Writing Te…
gatorsmile Feb 22, 2017
ef3c735
[SPARK-19694][ML] Add missing 'setTopicDistributionCol' for LDAModel
zhengruifeng Feb 22, 2017
bf7bb49
[SPARK-19679][ML] Destroy broadcasted object without blocking
zhengruifeng Feb 22, 2017
10c566c
[SPARK-13721][SQL] Make GeneratorOuter unresolved.
bogdanrdc Feb 22, 2017
e406537
[SPARK-19405][STREAMING] Support for cross-account Kinesis reads via STS
Feb 22, 2017
1f86e79
[SPARK-19616][SPARKR] weightCol and aggregationDepth should be improv…
wangmiao1981 Feb 22, 2017
37112fc
[SPARK-19666][SQL] Skip a property without getter in Java schema infe…
HyukjinKwon Feb 22, 2017
4661d30
[SPARK-19554][UI,YARN] Allow SHS URL to be used for tracking in YARN RM.
Feb 22, 2017
dc005ed
[SPARK-19658][SQL] Set NumPartitions of RepartitionByExpression In Pa…
gatorsmile Feb 23, 2017
d314750
[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] …
Feb 23, 2017
66c4b79
[SPARK-16122][CORE] Add rest api for job environment
uncleGen Feb 23, 2017
769aa0f
[SPARK-19695][SQL] Throw an exception if a `columnNameOfCorruptRecord…
maropu Feb 23, 2017
93aa427
[SPARK-19691][SQL] Fix ClassCastException when calculating percentile…
maropu Feb 23, 2017
78eae7e
[SPARK-19459] Support for nested char/varchar fields in ORC
hvanhovell Feb 23, 2017
7bf0943
[SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[…
actuaryzhang Feb 23, 2017
9bf4e2b
[SPARK-19497][SS] Implement streaming deduplication
zsxwing Feb 23, 2017
09ed6e7
[SPARK-18699][SQL] Put malformed tokens into a new field when parsing…
maropu Feb 23, 2017
4fa4cf1
[SPARK-19706][PYSPARK] add Column.contains in pyspark
cloud-fan Feb 23, 2017
f87a6a5
[SPARK-19684][DOCS] Remove developer info from docs.
kayousterhout Feb 23, 2017
eff7b40
[SPARK-19674][SQL] Ignore driver accumulator updates don't belong to …
carsonwang Feb 23, 2017
d027624
[SPARK-16122][DOCS] application environment rest api
uncleGen Feb 24, 2017
2f69e3f
[SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala imp…
BryanCutler Feb 24, 2017
d7e43b6
[SPARK-17075][SQL] implemented filter estimation
ron8hu Feb 24, 2017
8f33731
[SPARK-19664][SQL] put hive.metastore.warehouse.dir in hadoopconf to …
windpiger Feb 24, 2017
4a5e38f
[SPARK-19161][PYTHON][SQL] Improving UDF Docstrings
zero323 Feb 24, 2017
b0a8c16
[SPARK-19707][CORE] Improve the invalid path check for sc.addJar
jerryshao Feb 24, 2017
a920a43
[SPARK-19038][YARN] Avoid overwriting keytab configuration in yarn-cl…
jerryshao Feb 24, 2017
3e40f6c
[SPARK-17495][SQL] Add more tests for hive hash
tejasapatil Feb 24, 2017
05954f3
[SPARK-17075][SQL] Follow up: fix file line ending and improve the tests
lins05 Feb 24, 2017
69d0da6
[SPARK-17078][SQL] Show stats when explain
Feb 24, 2017
5cbd3b5
[SPARK-19560] Improve DAGScheduler tests.
kayousterhout Feb 24, 2017
5f74148
[SPARK-19597][CORE] test case for task deserialization errors
squito Feb 24, 2017
330c3e3
[SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to python worker
zjffdu Feb 24, 2017
fa7c582
[SPARK-15355][CORE] Proactive block replication
shubhamchopra Feb 24, 2017
1b9ba25
[MINOR][DOCS] Fix few typos in structured streaming doc
Feb 25, 2017
4cb025a
[SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs
gatorsmile Feb 25, 2017
8f0511e
[SPARK-19650] Commands should not trigger a Spark job
hvanhovell Feb 25, 2017
061bcfb
[MINOR][DOCS] Fixes two problems in the SQL programing guide page
boazmohar Feb 25, 2017
b639d71
Merge pull request #1 from apache/master
lxsmnv Feb 25, 2017
fe07de9
[SPARK-19673][SQL] "ThriftServer default app name is changed wrong"
lvdongr Feb 25, 2017
410392e
[SPARK-15288][MESOS] Mesos dispatcher should handle gracefully when a…
Feb 25, 2017
6ab6054
[MINOR][ML][DOC] Document default value for GeneralizedLinearRegressi…
jkbradley Feb 26, 2017
89608cf
[SPARK-17075][SQL][FOLLOWUP] fix some minor issues and clean up the code
cloud-fan Feb 26, 2017
68f2142
[SQL] Duplicate test exception in SQLQueryTestSuite due to meta files…
dilipbiswal Feb 26, 2017
9f8e392
[SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to h…
Feb 26, 2017
4ba9c6c
[MINOR][BUILD] Fix lint-java breaks in Java
HyukjinKwon Feb 27, 2017
8a5a585
[SPARK-15615][SQL][BUILD][FOLLOW-UP] Replace deprecated usage of json…
HyukjinKwon Feb 27, 2017
16d8472
[SPARK-19746][ML] Faster indexing for logistic aggregator
sethah Feb 28, 2017
7353038
[SPARK-19749][SS] Name socket source with a meaningful name
uncleGen Feb 28, 2017
a350bc1
[SPARK-19748][SQL] refresh function has a wrong order to do cache inv…
windpiger Feb 28, 2017
9b8eca6
[SPARK-19660][CORE][SQL] Replace the configuration property names tha…
wangyum Feb 28, 2017
b405466
[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy
Feb 28, 2017
7c7fc30
[SPARK-19678][SQL] remove MetastoreRelation
cloud-fan Feb 28, 2017
9734a92
[SPARK-19677][SS] Committing a delta file atop an existing one should…
vitillo Feb 28, 2017
ce233f1
[SPARK-19463][SQL] refresh cache after the InsertIntoHadoopFsRelation…
windpiger Feb 28, 2017
7e5359b
[SPARK-19610][SQL] Support parsing multiline CSV files
HyukjinKwon Feb 28, 2017
d743ea4
[MINOR][DOC] Update GLM doc to include tweedie distribution
actuaryzhang Feb 28, 2017
bf5987c
[SPARK-19769][DOCS] Update quickstart instructions
elmiko Feb 28, 2017
ca3864d
[SPARK-19373][MESOS] Base spark.scheduler.minRegisteredResourceRatio …
Feb 28, 2017
0fe8020
[SPARK-14503][ML] spark.ml API for FPGrowth
YY-OnCall Feb 28, 2017
7315880
[SPARK-19572][SPARKR] Allow to disable hive in sparkR shell
zjffdu Mar 1, 2017
89cd384
[SPARK-19460][SPARKR] Update dataset used in R documentation, example…
wangmiao1981 Mar 1, 2017
4913c92
[SPARK-19633][SS] FileSource read from FileSink
lw-lin Mar 1, 2017
38e7835
[SPARK-19736][SQL] refreshByPath should clear all cached plans with t…
viirya Mar 1, 2017
5502a9c
[SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be…
stanzhai Mar 1, 2017
8aa560b
[SPARK-19761][SQL] create InMemoryFileIndex with an empty rootPaths w…
windpiger Mar 1, 2017
417140e
[SPARK-19787][ML] Changing the default parameter of regParam.
datumbox Mar 1, 2017
2ff1467
[DOC][MINOR][SPARKR] Update SparkR doc for names, columns and colnames
actuaryzhang Mar 1, 2017
db0ddce
[SPARK-19775][SQL] Remove an obsolete `partitionBy().insertInto()` te…
dongjoon-hyun Mar 1, 2017
51be633
[SPARK-19777] Scan runningTasksSet when check speculatable tasks in T…
Mar 2, 2017
89990a0
[SPARK-13931] Stage can hang if an executor fails while speculated ta…
Mar 2, 2017
de2b53d
[SPARK-19583][SQL] CTAS for data source table with a created location…
windpiger Mar 2, 2017
3bd8ddf
[MINOR][ML] Fix comments in LSH Examples and Python API
Mar 2, 2017
d2a8797
[SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dro…
markgrover Mar 2, 2017
8d6ef89
[SPARK-18352][DOCS] wholeFile JSON update doc and programming guide
felixcheung Mar 2, 2017
625cfe0
[SPARK-19733][ML] Removed unnecessary castings and refactored checked…
datumbox Mar 2, 2017
50c08e8
[SPARK-19704][ML] AFTSurvivalRegression should support numeric censorCol
zhengruifeng Mar 2, 2017
9cca3db
[SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS
Mar 2, 2017
5ae3516
[SPARK-19720][CORE] Redact sensitive information from SparkSubmit con…
markgrover Mar 2, 2017
433d9eb
[SPARK-19631][CORE] OutputCommitCoordinator should not allow commits …
Mar 2, 2017
8417a7a
[SPARK-19276][CORE] Fetch Failure handling robust to user error handling
squito Mar 3, 2017
93ae176
[SPARK-19745][ML] SVCAggregator captures coefficients in its closure
sethah Mar 3, 2017
f37bb14
[SPARK-19602][SQL][TESTS] Add tests for qualified column names
skambha Mar 3, 2017
e24f21b
[SPARK-19779][SS] Delete needless tmp file after restart structured s…
gf53520 Mar 3, 2017
982f322
[SPARK-18726][SQL] resolveRelation for FileFormat DataSource don't ne…
windpiger Mar 3, 2017
d556b31
[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor …
HyukjinKwon Mar 3, 2017
fa50143
[SPARK-19739][CORE] propagate S3 session token to cluser
uncleGen Mar 3, 2017
0bac3e4
[SPARK-19797][DOC] ML pipeline document correction
ymwdalex Mar 3, 2017
776fac3
[SPARK-19801][BUILD] Remove JDK7 from Travis CI
dongjoon-hyun Mar 3, 2017
98bcc18
[SPARK-19758][SQL] Resolving timezone aware expressions with time zon…
viirya Mar 3, 2017
37a1c0e
[SPARK-19710][SQL][TESTS] Fix ordering of rows in query results
robbinspg Mar 3, 2017
9314c08
[SPARK-19774] StreamExecution should call stop() on sources when a st…
brkyvz Mar 3, 2017
ba186a8
[MINOR][DOC] Fix doc for web UI https configuration
jerryshao Mar 3, 2017
2a7921a
[SPARK-18939][SQL] Timezone support in partition values.
ueshin Mar 4, 2017
44281ca
[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe
BryanCutler Mar 4, 2017
f5fdbe0
[SPARK-13446][SQL] Support reading data from Hive 2.0.1 metastore
gatorsmile Mar 4, 2017
a6a7a95
[SPARK-19718][SS] Handle more interrupt cases properly for Hadoop
zsxwing Mar 4, 2017
8234183
Merge pull request #2 from apache/master
lxsmnv Mar 4, 2017
615d9f0
Merge branch 'SPARK-19340' into master
lxsmnv Mar 4, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 0 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ dist: trusty
# 2. Choose language and target JDKs for parallel builds.
language: java
jdk:
- oraclejdk7
- oraclejdk8

# 3. Setup cache directory for SBT and Maven.
Expand Down
2 changes: 1 addition & 1 deletion R/WINDOWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,6 @@ To run the SparkR unit tests on Windows, the following steps are required —ass

```
R -e "install.packages('testthat', repos='http://cran.us.r-project.org')"
.\bin\spark-submit2.cmd --conf spark.hadoop.fs.default.name="file:///" R\pkg\tests\run-all.R
.\bin\spark-submit2.cmd --conf spark.hadoop.fs.defaultFS="file:///" R\pkg\tests\run-all.R
```

12 changes: 10 additions & 2 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ setMethod("dtypes",

#' Column Names of SparkDataFrame
#'
#' Return all column names as a list.
#' Return a vector of column names.
#'
#' @param x a SparkDataFrame.
#'
Expand Down Expand Up @@ -338,7 +338,7 @@ setMethod("colnames",
})

#' @param value a character vector. Must have the same length as the number
#' of columns in the SparkDataFrame.
#' of columns to be renamed.
#' @rdname columns
#' @aliases colnames<-,SparkDataFrame-method
#' @name colnames<-
Expand Down Expand Up @@ -1804,6 +1804,10 @@ setClassUnion("numericOrcharacter", c("numeric", "character"))
#' @note [[ since 1.4.0
setMethod("[[", signature(x = "SparkDataFrame", i = "numericOrcharacter"),
function(x, i) {
if (length(i) > 1) {
warning("Subset index has length > 1. Only the first index is used.")
i <- i[1]
}
if (is.numeric(i)) {
cols <- columns(x)
i <- cols[[i]]
Expand All @@ -1817,6 +1821,10 @@ setMethod("[[", signature(x = "SparkDataFrame", i = "numericOrcharacter"),
#' @note [[<- since 2.1.1
setMethod("[[<-", signature(x = "SparkDataFrame", i = "numericOrcharacter"),
function(x, i, value) {
if (length(i) > 1) {
warning("Subset index has length > 1. Only the first index is used.")
i <- i[1]
}
if (is.numeric(i)) {
cols <- columns(x)
i <- cols[[i]]
Expand Down
10 changes: 7 additions & 3 deletions R/pkg/R/SQLContext.R
Original file line number Diff line number Diff line change
Expand Up @@ -332,8 +332,10 @@ setMethod("toDF", signature(x = "RDD"),

#' Create a SparkDataFrame from a JSON file.
#'
#' Loads a JSON file (\href{http://jsonlines.org/}{JSON Lines text format or newline-delimited JSON}
#' ), returning the result as a SparkDataFrame
#' Loads a JSON file, returning the result as a SparkDataFrame
#' By default, (\href{http://jsonlines.org/}{JSON Lines text format or newline-delimited JSON}
#' ) is supported. For JSON (one record per file), set a named property \code{wholeFile} to
#' \code{TRUE}.
#' It goes through the entire dataset once to determine the schema.
#'
#' @param path Path of file to read. A vector of multiple paths is allowed.
Expand All @@ -346,6 +348,7 @@ setMethod("toDF", signature(x = "RDD"),
#' sparkR.session()
#' path <- "path/to/file.json"
#' df <- read.json(path)
#' df <- read.json(path, wholeFile = TRUE)
#' df <- jsonFile(path)
#' }
#' @name read.json
Expand Down Expand Up @@ -778,14 +781,15 @@ dropTempView <- function(viewName) {
#' @return SparkDataFrame
#' @rdname read.df
#' @name read.df
#' @seealso \link{read.json}
#' @export
#' @examples
#'\dontrun{
#' sparkR.session()
#' df1 <- read.df("path/to/file.json", source = "json")
#' schema <- structType(structField("name", "string"),
#' structField("info", "map<string,double>"))
#' df2 <- read.df(mapTypeJsonPath, "json", schema)
#' df2 <- read.df(mapTypeJsonPath, "json", schema, wholeFile = TRUE)
#' df3 <- loadDF("data/test_table", "parquet", mergeSchema = "true")
#' }
#' @name read.df
Expand Down
2 changes: 1 addition & 1 deletion R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -1406,7 +1406,7 @@ setGeneric("spark.randomForest",

#' @rdname spark.survreg
#' @export
setGeneric("spark.survreg", function(data, formula) { standardGeneric("spark.survreg") })
setGeneric("spark.survreg", function(data, formula, ...) { standardGeneric("spark.survreg") })

#' @rdname spark.svmLinear
#' @export
Expand Down
28 changes: 16 additions & 12 deletions R/pkg/R/mllib_classification.R
Original file line number Diff line number Diff line change
Expand Up @@ -75,9 +75,9 @@ setClass("NaiveBayesModel", representation(jobj = "jobj"))
#' @examples
#' \dontrun{
#' sparkR.session()
#' df <- createDataFrame(iris)
#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
#' model <- spark.svmLinear(training, Species ~ ., regParam = 0.5)
#' t <- as.data.frame(Titanic)
#' training <- createDataFrame(t)
#' model <- spark.svmLinear(training, Survived ~ ., regParam = 0.5)
#' summary <- summary(model)
#'
#' # fitted values on training data
Expand Down Expand Up @@ -207,6 +207,9 @@ function(object, path, overwrite = FALSE) {
#' excepting that at most one value may be 0. The class with largest value p/t is predicted, where p
#' is the original probability of that class and t is the class's threshold.
#' @param weightCol The weight column name.
#' @param aggregationDepth The depth for treeAggregate (greater than or equal to 2). If the dimensions of features
#' or the number of partitions are large, this param could be adjusted to a larger size.
#' This is an expert parameter. Default value should be good for most cases.
#' @param ... additional arguments passed to the method.
#' @return \code{spark.logit} returns a fitted logistic regression model.
#' @rdname spark.logit
Expand All @@ -217,9 +220,9 @@ function(object, path, overwrite = FALSE) {
#' \dontrun{
#' sparkR.session()
#' # binary logistic regression
#' df <- createDataFrame(iris)
#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
#' model <- spark.logit(training, Species ~ ., regParam = 0.5)
#' t <- as.data.frame(Titanic)
#' training <- createDataFrame(t)
#' model <- spark.logit(training, Survived ~ ., regParam = 0.5)
#' summary <- summary(model)
#'
#' # fitted values on training data
Expand All @@ -236,28 +239,29 @@ function(object, path, overwrite = FALSE) {
#'
#' # multinomial logistic regression
#'
#' df <- createDataFrame(iris)
#' model <- spark.logit(df, Species ~ ., regParam = 0.5)
#' model <- spark.logit(training, Class ~ ., regParam = 0.5)
#' summary <- summary(model)
#'
#' }
#' @note spark.logit since 2.1.0
setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, regParam = 0.0, elasticNetParam = 0.0, maxIter = 100,
tol = 1E-6, family = "auto", standardization = TRUE,
thresholds = 0.5, weightCol = NULL) {
thresholds = 0.5, weightCol = NULL, aggregationDepth = 2) {
formula <- paste(deparse(formula), collapse = "")

if (is.null(weightCol)) {
weightCol <- ""
if (!is.null(weightCol) && weightCol == "") {
weightCol <- NULL
} else if (!is.null(weightCol)) {
weightCol <- as.character(weightCol)
}

jobj <- callJStatic("org.apache.spark.ml.r.LogisticRegressionWrapper", "fit",
data@sdf, formula, as.numeric(regParam),
as.numeric(elasticNetParam), as.integer(maxIter),
as.numeric(tol), as.character(family),
as.logical(standardization), as.array(thresholds),
as.character(weightCol))
weightCol, as.integer(aggregationDepth))
new("LogisticRegressionModel", jobj = jobj)
})

Expand Down
15 changes: 8 additions & 7 deletions R/pkg/R/mllib_clustering.R
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,9 @@ setClass("LDAModel", representation(jobj = "jobj"))
#' @examples
#' \dontrun{
#' sparkR.session()
#' df <- createDataFrame(iris)
#' model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
#' t <- as.data.frame(Titanic)
#' df <- createDataFrame(t)
#' model <- spark.bisectingKmeans(df, Class ~ Survived, k = 4)
#' summary(model)
#'
#' # get fitted result from a bisecting k-means model
Expand All @@ -82,7 +83,7 @@ setClass("LDAModel", representation(jobj = "jobj"))
#'
#' # fitted values on training data
#' fitted <- predict(model, df)
#' head(select(fitted, "Sepal_Length", "prediction"))
#' head(select(fitted, "Class", "prediction"))
#'
#' # save fitted model to input path
#' path <- "path/to/model"
Expand Down Expand Up @@ -338,14 +339,14 @@ setMethod("write.ml", signature(object = "GaussianMixtureModel", path = "charact
#' @examples
#' \dontrun{
#' sparkR.session()
#' data(iris)
#' df <- createDataFrame(iris)
#' model <- spark.kmeans(df, Sepal_Length ~ Sepal_Width, k = 4, initMode = "random")
#' t <- as.data.frame(Titanic)
#' df <- createDataFrame(t)
#' model <- spark.kmeans(df, Class ~ Survived, k = 4, initMode = "random")
#' summary(model)
#'
#' # fitted values on training data
#' fitted <- predict(model, df)
#' head(select(fitted, "Sepal_Length", "prediction"))
#' head(select(fitted, "Class", "prediction"))
#'
#' # save fitted model to input path
#' path <- "path/to/model"
Expand Down
38 changes: 23 additions & 15 deletions R/pkg/R/mllib_regression.R
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,14 @@ setClass("IsotonicRegressionModel", representation(jobj = "jobj"))
#' @examples
#' \dontrun{
#' sparkR.session()
#' data(iris)
#' df <- createDataFrame(iris)
#' model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian")
#' t <- as.data.frame(Titanic)
#' df <- createDataFrame(t)
#' model <- spark.glm(df, Freq ~ Sex + Age, family = "gaussian")
#' summary(model)
#'
#' # fitted values on training data
#' fitted <- predict(model, df)
#' head(select(fitted, "Sepal_Length", "prediction"))
#' head(select(fitted, "Freq", "prediction"))
#'
#' # save fitted model to input path
#' path <- "path/to/model"
Expand All @@ -102,14 +102,16 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
}

formula <- paste(deparse(formula), collapse = "")
if (is.null(weightCol)) {
weightCol <- ""
if (!is.null(weightCol) && weightCol == "") {
weightCol <- NULL
} else if (!is.null(weightCol)) {
weightCol <- as.character(weightCol)
}

# For known families, Gamma is upper-cased
jobj <- callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper",
"fit", formula, data@sdf, tolower(family$family), family$link,
tol, as.integer(maxIter), as.character(weightCol), regParam)
tol, as.integer(maxIter), weightCol, regParam)
new("GeneralizedLinearRegressionModel", jobj = jobj)
})

Expand All @@ -135,9 +137,9 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
#' @examples
#' \dontrun{
#' sparkR.session()
#' data(iris)
#' df <- createDataFrame(iris)
#' model <- glm(Sepal_Length ~ Sepal_Width, df, family = "gaussian")
#' t <- as.data.frame(Titanic)
#' df <- createDataFrame(t)
#' model <- glm(Freq ~ Sex + Age, df, family = "gaussian")
#' summary(model)
#' }
#' @note glm since 1.5.0
Expand Down Expand Up @@ -305,13 +307,15 @@ setMethod("spark.isoreg", signature(data = "SparkDataFrame", formula = "formula"
function(data, formula, isotonic = TRUE, featureIndex = 0, weightCol = NULL) {
formula <- paste(deparse(formula), collapse = "")

if (is.null(weightCol)) {
weightCol <- ""
if (!is.null(weightCol) && weightCol == "") {
weightCol <- NULL
} else if (!is.null(weightCol)) {
weightCol <- as.character(weightCol)
}

jobj <- callJStatic("org.apache.spark.ml.r.IsotonicRegressionWrapper", "fit",
data@sdf, formula, as.logical(isotonic), as.integer(featureIndex),
as.character(weightCol))
weightCol)
new("IsotonicRegressionModel", jobj = jobj)
})

Expand Down Expand Up @@ -372,6 +376,10 @@ setMethod("write.ml", signature(object = "IsotonicRegressionModel", path = "char
#' @param formula a symbolic description of the model to be fitted. Currently only a few formula
#' operators are supported, including '~', ':', '+', and '-'.
#' Note that operator '.' is not supported currently.
#' @param aggregationDepth The depth for treeAggregate (greater than or equal to 2). If the dimensions of features
#' or the number of partitions are large, this param could be adjusted to a larger size.
#' This is an expert parameter. Default value should be good for most cases.
#' @param ... additional arguments passed to the method.
#' @return \code{spark.survreg} returns a fitted AFT survival regression model.
#' @rdname spark.survreg
#' @seealso survival: \url{https://cran.r-project.org/package=survival}
Expand All @@ -396,10 +404,10 @@ setMethod("write.ml", signature(object = "IsotonicRegressionModel", path = "char
#' }
#' @note spark.survreg since 2.0.0
setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula) {
function(data, formula, aggregationDepth = 2) {
formula <- paste(deparse(formula), collapse = "")
jobj <- callJStatic("org.apache.spark.ml.r.AFTSurvivalRegressionWrapper",
"fit", formula, data@sdf)
"fit", formula, data@sdf, as.integer(aggregationDepth))
new("AFTSurvivalRegressionModel", jobj = jobj)
})

Expand Down
18 changes: 10 additions & 8 deletions R/pkg/R/mllib_tree.R
Original file line number Diff line number Diff line change
Expand Up @@ -143,14 +143,15 @@ print.summary.treeEnsemble <- function(x) {
#'
#' # fit a Gradient Boosted Tree Classification Model
#' # label must be binary - Only binary classification is supported for GBT.
#' df <- createDataFrame(iris[iris$Species != "virginica", ])
#' model <- spark.gbt(df, Species ~ Petal_Length + Petal_Width, "classification")
#' t <- as.data.frame(Titanic)
#' df <- createDataFrame(t)
#' model <- spark.gbt(df, Survived ~ Age + Freq, "classification")
#'
#' # numeric label is also supported
#' iris2 <- iris[iris$Species != "virginica", ]
#' iris2$NumericSpecies <- ifelse(iris2$Species == "setosa", 0, 1)
#' df <- createDataFrame(iris2)
#' model <- spark.gbt(df, NumericSpecies ~ ., type = "classification")
#' t2 <- as.data.frame(Titanic)
#' t2$NumericGender <- ifelse(t2$Sex == "Male", 0, 1)
#' df <- createDataFrame(t2)
#' model <- spark.gbt(df, NumericGender ~ ., type = "classification")
#' }
#' @note spark.gbt since 2.1.0
setMethod("spark.gbt", signature(data = "SparkDataFrame", formula = "formula"),
Expand Down Expand Up @@ -351,8 +352,9 @@ setMethod("write.ml", signature(object = "GBTClassificationModel", path = "chara
#' summary(savedModel)
#'
#' # fit a Random Forest Classification Model
#' df <- createDataFrame(iris)
#' model <- spark.randomForest(df, Species ~ Petal_Length + Petal_Width, "classification")
#' t <- as.data.frame(Titanic)
#' df <- createDataFrame(t)
#' model <- spark.randomForest(df, Survived ~ Freq + Age, "classification")
#' }
#' @note spark.randomForest since 2.1.0
setMethod("spark.randomForest", signature(data = "SparkDataFrame", formula = "formula"),
Expand Down
10 changes: 9 additions & 1 deletion R/pkg/inst/tests/testthat/test_mllib_classification.R
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,15 @@ test_that("spark.logit", {
df <- createDataFrame(data)
model <- spark.logit(df, label ~ feature)
prediction <- collect(select(predict(model, df), "prediction"))
expect_equal(prediction$prediction, c("0.0", "0.0", "1.0", "1.0", "0.0"))
expect_equal(sort(prediction$prediction), c("0.0", "0.0", "0.0", "1.0", "1.0"))

# Test prediction with weightCol
weight <- c(2.0, 2.0, 2.0, 1.0, 1.0)
data2 <- as.data.frame(cbind(label, feature, weight))
df2 <- createDataFrame(data2)
model2 <- spark.logit(df2, label ~ feature, weightCol = "weight")
prediction2 <- collect(select(predict(model2, df2), "prediction"))
expect_equal(sort(prediction2$prediction), c("0.0", "0.0", "0.0", "0.0", "0.0"))
})

test_that("spark.mlp", {
Expand Down
18 changes: 18 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -898,6 +898,12 @@ test_that("names() colnames() set the column names", {
expect_equal(names(z)[3], "c")
names(z)[3] <- "c2"
expect_equal(names(z)[3], "c2")

# Test subset assignment
colnames(df)[1] <- "col5"
expect_equal(colnames(df)[1], "col5")
names(df)[2] <- "col6"
expect_equal(names(df)[2], "col6")
})

test_that("head() and first() return the correct data", {
Expand Down Expand Up @@ -1015,6 +1021,18 @@ test_that("select operators", {
expect_is(df[[2]], "Column")
expect_is(df[["age"]], "Column")

expect_warning(df[[1:2]],
"Subset index has length > 1. Only the first index is used.")
expect_is(suppressWarnings(df[[1:2]]), "Column")
expect_warning(df[[c("name", "age")]],
"Subset index has length > 1. Only the first index is used.")
expect_is(suppressWarnings(df[[c("name", "age")]]), "Column")

expect_warning(df[[1:2]] <- df[[1]],
"Subset index has length > 1. Only the first index is used.")
expect_warning(df[[c("name", "age")]] <- df[[1]],
"Subset index has length > 1. Only the first index is used.")

expect_is(df[, 1, drop = F], "SparkDataFrame")
expect_equal(columns(df[, 1, drop = F]), c("name"))
expect_equal(columns(df[, "age", drop = F]), c("age"))
Expand Down
Loading