Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Apr 14, 2016

What changes were proposed in this pull request?

Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R summary.glm are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.

How was this patch tested?

Unit tests.

SparkR Output:

Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
     Min        1Q    Median        3Q       Max
-0.95096  -0.16585  -0.00232   0.17410   0.72918

Coefficients:
                    Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)         1.6765    0.23536     7.1231   4.4561e-11
Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
Species_versicolor  -0.98339  0.072075    -13.644  0
Species_virginica   -1.0075   0.093306    -10.798  0

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.22

Number of Fisher Scoring iterations: 1

R output:

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.95096  -0.16522   0.00171   0.18416   0.72918  

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.217

Number of Fisher Scoring iterations: 2

cc @mengxr

R/pkg/R/mllib.R Outdated
#' @export
print.summary.GeneralizedLinearRegressionModel <- function(x, ...) {
x$deviance.resid <- setNames(unlist(approxQuantile(x$deviance.resid, "devianceResiduals",
c(0.0, 0.25, 0.5, 0.75, 1.0), 0.0)), c("Min", "1Q", "Median", "3Q", "Max"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we set relativeError of approxQuantile to 0.0 which may be very expensive for computing. Should we change to more loose value and document the difference between SparkR and R?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite expensive. If we just need to show the quartiles, a relative error of 0.01 should work well. I'm not sure whether the min and max would be correct or not, worthing testing. In the output, we can mention relative error <= 0.01.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. The min and max value is correct. See the updated summary output on the above PR description.

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55819 has finished for PR 12393 at commit 51ce286.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @rdname print
#' @name print.summary.GeneralizedLinearRegressionModel
#' @export
print.summary.GeneralizedLinearRegressionModel <- function(x, ...) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this S3 function for formatted output of summary(GeneralizedLinearRegressionModel).

@mengxr
Copy link
Contributor

mengxr commented Apr 15, 2016

LGTM except relativeError and a magic number in test. This PR looks great!!

@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55920 has finished for PR 12393 at commit 5424615.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Apr 15, 2016

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 83af297 Apr 15, 2016
@yanboliang yanboliang deleted the spark-13925 branch April 16, 2016 04:21
lw-lin pushed a commit to lw-lin/spark that referenced this pull request Apr 20, 2016
…:glm for more family and link functions

## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.

## How was this patch tested?
Unit tests.

SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
     Min        1Q    Median        3Q       Max
-0.95096  -0.16585  -0.00232   0.17410   0.72918

Coefficients:
                    Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)         1.6765    0.23536     7.1231   4.4561e-11
Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
Species_versicolor  -0.98339  0.072075    -13.644  0
Species_virginica   -1.0075   0.093306    -10.798  0

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.22

Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
     Min        1Q    Median        3Q       Max
-0.95096  -0.16522   0.00171   0.18416   0.72918

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.217

Number of Fisher Scoring iterations: 2
```

cc mengxr

Author: Yanbo Liang <[email protected]>

Closes apache#12393 from yanboliang/spark-13925.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants