[SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column #16689

titicaca · 2017-01-24T06:50:24Z

What changes were proposed in this pull request?

Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:

library(SparkR)
sparkR.session(master = "local")
df <- data.frame(col1 = c(0, 1, 2), 
                 col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))

sdf1 <- createDataFrame(df)
print(dtypes(sdf1))
df1 <- collect(sdf1)
print(lapply(df1, class))

sdf2 <- filter(sdf1, "col1 > 0")
print(dtypes(sdf2))
df2 <- collect(sdf2)
print(lapply(df2, class))

As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.

This is caused by method do.call(c, list), if we convert a list, i.e. do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01")), the class of the result is numeric instead of POSIXct.

Therefore, we need to cast the data type of the vector explicitly.

How was this patch tested?

The patch can be tested manually with the same code above.

felixcheung · 2017-01-24T08:50:29Z

Thanks! I can verify this case and the fix.
Could you please add some tests for this?

HyukjinKwon · 2017-01-24T09:36:08Z

(Oh. it was all written in the PR description... I removed my useless comments..)

felixcheung · 2017-01-24T09:52:37Z

R/pkg/R/DataFrame.R

+                      vec <- do.call(c, colTail)
+                      classVal <- class(vec)
+                      vec <- c(rep(NA, valueIndex[1] - 1), vec)
+                      class(vec) <- classVal


Hmm, what happened here?
if you want to drop the NA and use the rest to infer the class you can do col[!is.na(col)]

titicaca · 2017-01-24T09:52:50Z

Sure. Shall I add the tests in pkg/inst/tests/testthat/test_sparkSQL.R?

felixcheung · 2017-01-24T17:26:48Z

yes. but please see my other comment

…mn, if column values are all NAs, the type is logical

titicaca · 2017-01-25T03:23:18Z

Sorry for the late reply. I figured out that the tests failed because if a vector is with only NAs, the type is logical, therefore we cannot cast the type in that case. I have updated the codes and added some tests for that. Thank you for the advice.

felixcheung · 2017-01-25T04:26:02Z

great! @shivaram could you get Jenkins to test this fix please? I don't seem to have the power to command it :)

shivaram · 2017-01-25T04:27:50Z

Jenkins, retest this please

felixcheung · 2017-01-25T04:28:43Z

R/pkg/R/DataFrame.R

                  if (!is.null(PRIMITIVE_TYPES[[colType]]) && colType != "binary") {
                    vec <- do.call(c, col)
                    stopifnot(class(vec) != "list")
+                    # If vec is an vector with only NAs, the type is logical


if the DataFrame column is of type string, shouldn't it converts to R as character (which can be all NA), even though the column only has NULL (which maps to NA in R)?

it seems with this change it would become logical in R instead of character.

Yes. My first commit was trying to cast the column to its corresponding R data type explicitly, even if it is an vector with all NAs. However some existed tests were failed and expecting to get logical NA. For example

3. Failure: column functions (@test_sparkSQL.R#1280) --------------------------- collect(select(df, first(df$age)))[[1]] not equal to NA. Types not compatible: double vs logical 4. Failure: column functions (@test_sparkSQL.R#1282) --------------------------- collect(select(df, first("age")))[[1]] not equal to NA. Types not compatible: double vs logical

In local R, if we try

df <- data.frame(x = c(0,1,2), y = c(NA, NA, 1)) class(head(df, 1)$y)

The output is still numeric instead of logical. But the existed test is expecting NA logical instead of NA numeric.

So is it necessary to correct the existed tests, for example @test_sparkSQL.R#1280
from expect_equal(collect(select(df, first(df$age)))[[1]], NA) to
expect_equal(collect(select(df, first(df$age)))[[1]], NA_real_)

shivaram · 2017-01-25T04:29:36Z

Jenkins, ok to test

SparkQA · 2017-01-25T04:37:25Z

Test build #71969 has finished for PR 16689 at commit 6a0eb3f.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

…to PRIMITIVE_TYPE, in addition two existed tests (@test_sparkSQL.R#1280 and @test_sparkSQL.R#1282) are modfied

SparkQA · 2017-01-25T09:46:48Z

Test build #71982 has finished for PR 16689 at commit 43e334a.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

titicaca · 2017-01-26T03:17:13Z

I have modified the codes and tests, including the existed tests @test_sparkSQL.R#1280 and @test_sparkSQL.R#1282.

Like in local R, now NA column of the SparkDataFrame will also be collected as its corresponding type instead of logical NA.

felixcheung · 2017-01-27T05:01:17Z

Just to make sure you see this: #16689 (comment)

SparkQA · 2017-01-31T03:07:49Z

Test build #72182 has finished for PR 16689 at commit 7903bb3.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-31T06:14:46Z

Test build #72186 has finished for PR 16689 at commit 8379c38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-31T06:22:08Z

R/pkg/R/DataFrame.R

                    stopifnot(class(vec) != "list")
+                    class(vec) <-
+                      if (colType == "timestamp")
+                        c("POSIXct", "POSIXt")


why should the class be c("POSIXct", "POSIXt") in this case?

Because PRIMITIVE_TYPES[["timestamp"]] is POSIXct, it usually comes with POSIXt together. POSIXt is virtual class used to allow operations such as subtraction to mix the two classes POSIXct and POSIXlt.
The previous convertion will also convert timestamp to c("POSIXct", "POSIXt").

Should PRIMITIVE_TYPES[["timestamp"]] be changed then
https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L32

It looks better if it won't affect other methods. I will try it. Thanks for the advice.

felixcheung · 2017-01-31T06:23:11Z

R/pkg/R/DataFrame.R

+                      if (colType == "timestamp")
+                        c("POSIXct", "POSIXt")
+                      else
+                        PRIMITIVE_TYPES[[colType]]


by setting these instead of having it inferred - does this break any existing behavior? does any type differ because of this line of change?

Currently all tests are passed, except for the two modified tests with NA types as discussed before. The followings are the all type convertions from SparkDataframe to R data.frame, which have been tested in the existing tests in test_sparkSQL.R.

PRIMITIVE_TYPES <- as.environment(list( "tinyint" = "integer", "smallint" = "integer", "int" = "integer", "bigint" = "numeric", "float" = "numeric", "double" = "numeric", "decimal" = "numeric", "string" = "character", "binary" = "raw", "boolean" = "logical", "timestamp" = "POSIXct", "date" = "Date", # following types are not SQL types returned by dtypes(). They are listed here for usage # by checkType() in schema.R. # TODO: refactor checkType() in schema.R. "byte" = "integer", "integer" = "integer" ))

SparkQA · 2017-02-01T16:13:28Z

Test build #72250 has finished for PR 16689 at commit 407c625.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

titicaca · 2017-02-02T02:23:52Z

I tried to modify the PRIMITIVE_TYPES for timestamp, but it had a side effect on coltypes method.

In test_sparkSQL.R#2262, expect_equal(coltypes(DF), c("integer", "logical", "POSIXct")), coltypes return a list instead of a vector because of the convertion from timestamp to c(POSIXct, POSIXt)

felixcheung · 2017-02-02T05:51:32Z

hmm, that's not a super big issue since vector and list is more or less the same in R.
I think it might be better if we are treating the type consistently, although it might be concerning if this is changing in a non-backward compatible manner.

let me try to find some time to test this out? thanks!

SparkQA · 2017-02-04T16:03:50Z

Test build #72378 has finished for PR 16689 at commit d6d454e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

titicaca · 2017-02-04T16:36:29Z

Thanks. I tried to fix the method coltypes for the modification of the timestamp, and it can pass all the tests now.

felixcheung · 2017-02-04T23:42:43Z

hmm, this seems like a reasonable approach. With these changes:

collect on timestamp would get c("POSIXct", "POSIXt")
coltypes output will not change

@shivaram what do you think?

shivaram · 2017-02-08T15:03:48Z

@felixcheung @titicaca Just to make sure I understand, collect on timestamp was getting c("POSIXct", "POSIXt") even before this change ?

titicaca · 2017-02-09T10:09:59Z

Yes, collect on timestamp was getting c("POSIXct", "POSIXt"). But when NA exists at the top of the timestamp column, it was getting numeric as I described in the PR description.

shivaram · 2017-02-09T20:42:37Z

Ok - I think this sounds good then ! @felixcheung Let me know if you want me to take a look at the code as well or if not feel free to merge when you think its ready

felixcheung · 2017-02-12T18:41:05Z

great, this is a good catch and thank you for fixing this @titicaca
merging to master, branch-2.1

…stamp column ## What changes were proposed in this pull request? Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs: ``` library(SparkR) sparkR.session(master = "local") df <- data.frame(col1 = c(0, 1, 2), col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01"))) sdf1 <- createDataFrame(df) print(dtypes(sdf1)) df1 <- collect(sdf1) print(lapply(df1, class)) sdf2 <- filter(sdf1, "col1 > 0") print(dtypes(sdf2)) df2 <- collect(sdf2) print(lapply(df2, class)) ``` As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column. This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct. Therefore, we need to cast the data type of the vector explicitly. ## How was this patch tested? The patch can be tested manually with the same code above. Author: titicaca <[email protected]> Closes #16689 from titicaca/sparkr-dev. (cherry picked from commit bc0a0e6) Signed-off-by: Felix Cheung <[email protected]>

felixcheung · 2017-02-12T18:46:44Z

@titicaca do you have a JIRA id on https://issues.apache.org? We would resolve the bug to you.

titicaca · 2017-02-13T07:18:22Z

Yes. The JIRA id is SPARK-19342. This is my first commit to SPARK project. Thank you for the help and advices :)

srowen · 2017-02-13T11:25:19Z

@titicaca he means, what is your user ID on JIRA? so we can credit you. It's clear what the JIRA is.

titicaca · 2017-02-13T12:50:53Z

Thanks for the reminder. I may have forgotten to mention that I am the reporter of this JIRA bug. My JIRA ID is also titicaca. Thank you!

felixcheung · 2017-02-13T17:55:22Z

Great, done!
Looking forward to more contributions from you :)

…stamp column ## What changes were proposed in this pull request? Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs: ``` library(SparkR) sparkR.session(master = "local") df <- data.frame(col1 = c(0, 1, 2), col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01"))) sdf1 <- createDataFrame(df) print(dtypes(sdf1)) df1 <- collect(sdf1) print(lapply(df1, class)) sdf2 <- filter(sdf1, "col1 > 0") print(dtypes(sdf2)) df2 <- collect(sdf2) print(lapply(df2, class)) ``` As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column. This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct. Therefore, we need to cast the data type of the vector explicitly. ## How was this patch tested? The patch can be tested manually with the same code above. Author: titicaca <[email protected]> Closes apache#16689 from titicaca/sparkr-dev.

SPARK-19342 bug fixed in collect method for collecting timestamp column

a51c2eb

titicaca changed the title ~~SPARK-19342 bug fixed in collect method for collecting timestamp column~~ [SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column Jan 24, 2017

SPARK-19342 bug fixed in collect method for collecting timestamp column

5829ddd

felixcheung reviewed Jan 24, 2017

View reviewed changes

SPARK-19342 bug fixed in collect method for collecting timestamp colu…

6a0eb3f

…mn, if column values are all NAs, the type is logical

felixcheung reviewed Jan 25, 2017

View reviewed changes

titicaca added 2 commits January 25, 2017 13:39

SPARK-19342 fix R code style error in test

09d4dfb

SPARK-19342 If vec is an vector with only NAs, the type will be cast …

43e334a

…to PRIMITIVE_TYPE, in addition two existed tests (@test_sparkSQL.R#1280 and @test_sparkSQL.R#1282) are modfied

SPARK-19342 fix for R style checks

7903bb3

SPARK-19342 remove the comment

8379c38

felixcheung reviewed Jan 31, 2017

View reviewed changes

SPARK-19342 modify PRIMITIVE_TYPES for type timestamp

407c625

SPARK-19342 fix for method coltypes in test_sparkSQL.R#2262

d6d454e

asfgit closed this in bc0a0e6 Feb 12, 2017

[SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column #16689

[SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column #16689

Uh oh!

Conversation

titicaca commented Jan 24, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

felixcheung commented Jan 24, 2017

Uh oh!

HyukjinKwon commented Jan 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

titicaca commented Jan 24, 2017

Uh oh!

felixcheung commented Jan 24, 2017

Uh oh!

titicaca commented Jan 25, 2017

Uh oh!

felixcheung commented Jan 25, 2017

Uh oh!

shivaram commented Jan 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

titicaca commented Jan 26, 2017

Uh oh!

felixcheung commented Jan 27, 2017

Uh oh!

SparkQA commented Jan 31, 2017

Uh oh!

SparkQA commented Jan 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2017

Uh oh!

titicaca commented Feb 2, 2017

Uh oh!

felixcheung commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 4, 2017

Uh oh!

titicaca commented Feb 4, 2017

Uh oh!

felixcheung commented Feb 4, 2017

Uh oh!

shivaram commented Feb 8, 2017

Uh oh!

titicaca commented Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram commented Feb 9, 2017

Uh oh!

felixcheung commented Feb 12, 2017

Uh oh!

felixcheung commented Feb 12, 2017

HyukjinKwon commented Jan 24, 2017 •

edited

Loading

felixcheung commented Feb 2, 2017 •

edited

Loading

titicaca commented Feb 9, 2017 •

edited

Loading

titicaca commented Feb 13, 2017 •

edited

Loading