-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13925] [ML] [SparkR] Expose R-like summary statistics in SparkR::glm for more family and link functions #12393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
R/pkg/R/mllib.R
Outdated
| #' @export | ||
| print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { | ||
| x$deviance.resid <- setNames(unlist(approxQuantile(x$deviance.resid, "devianceResiduals", | ||
| c(0.0, 0.25, 0.5, 0.75, 1.0), 0.0)), c("Min", "1Q", "Median", "3Q", "Max")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we set relativeError of approxQuantile to 0.0 which may be very expensive for computing. Should we change to more loose value and document the difference between SparkR and R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite expensive. If we just need to show the quartiles, a relative error of 0.01 should work well. I'm not sure whether the min and max would be correct or not, worthing testing. In the output, we can mention relative error <= 0.01.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. The min and max value is correct. See the updated summary output on the above PR description.
|
Test build #55819 has finished for PR 12393 at commit
|
| #' @rdname print | ||
| #' @name print.summary.GeneralizedLinearRegressionModel | ||
| #' @export | ||
| print.summary.GeneralizedLinearRegressionModel <- function(x, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this S3 function for formatted output of summary(GeneralizedLinearRegressionModel).
|
LGTM except |
|
Test build #55920 has finished for PR 12393 at commit
|
|
LGTM. Merged into master. Thanks! |
…:glm for more family and link functions ## What changes were proposed in this pull request? Expose R-like summary statistics in SparkR::glm for more family and link functions. Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work. ## How was this patch tested? Unit tests. SparkR Output: ``` Deviance Residuals: (Note: These are approximate quantiles with relative error <= 0.01) Min 1Q Median 3Q Max -0.95096 -0.16585 -0.00232 0.17410 0.72918 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6765 0.23536 7.1231 4.4561e-11 Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12 Species_versicolor -0.98339 0.072075 -13.644 0 Species_virginica -1.0075 0.093306 -10.798 0 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.22 Number of Fisher Scoring iterations: 1 ``` R output: ``` Deviance Residuals: Min 1Q Median 3Q Max -0.95096 -0.16522 0.00171 0.18416 0.72918 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.67650 0.23536 7.123 4.46e-11 *** Sepal.Length 0.34988 0.04630 7.557 4.19e-12 *** Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 *** Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.217 Number of Fisher Scoring iterations: 2 ``` cc mengxr Author: Yanbo Liang <[email protected]> Closes apache#12393 from yanboliang/spark-13925.
What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R summary.glm are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.
How was this patch tested?
Unit tests.
SparkR Output:
R output:
cc @mengxr