-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column #16689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks! I can verify this case and the fix. |
|
(Oh. it was all written in the PR description... I removed my useless comments..) |
R/pkg/R/DataFrame.R
Outdated
| vec <- do.call(c, colTail) | ||
| classVal <- class(vec) | ||
| vec <- c(rep(NA, valueIndex[1] - 1), vec) | ||
| class(vec) <- classVal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what happened here?
if you want to drop the NA and use the rest to infer the class you can do col[!is.na(col)]
|
Sure. Shall I add the tests in pkg/inst/tests/testthat/test_sparkSQL.R? |
|
yes. but please see my other comment |
…mn, if column values are all NAs, the type is logical
|
Sorry for the late reply. I figured out that the tests failed because if a vector is with only NAs, the type is logical, therefore we cannot cast the type in that case. I have updated the codes and added some tests for that. Thank you for the advice. |
|
great! @shivaram could you get Jenkins to test this fix please? I don't seem to have the power to command it :) |
|
Jenkins, retest this please |
R/pkg/R/DataFrame.R
Outdated
| if (!is.null(PRIMITIVE_TYPES[[colType]]) && colType != "binary") { | ||
| vec <- do.call(c, col) | ||
| stopifnot(class(vec) != "list") | ||
| # If vec is an vector with only NAs, the type is logical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the DataFrame column is of type string, shouldn't it converts to R as character (which can be all NA), even though the column only has NULL (which maps to NA in R)?
it seems with this change it would become logical in R instead of character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. My first commit was trying to cast the column to its corresponding R data type explicitly, even if it is an vector with all NAs. However some existed tests were failed and expecting to get logical NA. For example
3. Failure: column functions (@test_sparkSQL.R#1280) ---------------------------
collect(select(df, first(df$age)))[[1]] not equal to NA.
Types not compatible: double vs logical
4. Failure: column functions (@test_sparkSQL.R#1282) ---------------------------
collect(select(df, first("age")))[[1]] not equal to NA.
Types not compatible: double vs logical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In local R, if we try
df <- data.frame(x = c(0,1,2), y = c(NA, NA, 1))
class(head(df, 1)$y)
The output is still numeric instead of logical. But the existed test is expecting NA logical instead of NA numeric.
So is it necessary to correct the existed tests, for example @test_sparkSQL.R#1280
from expect_equal(collect(select(df, first(df$age)))[[1]], NA) to
expect_equal(collect(select(df, first(df$age)))[[1]], NA_real_)
|
Jenkins, ok to test |
|
Test build #71969 has finished for PR 16689 at commit
|
…to PRIMITIVE_TYPE, in addition two existed tests (@test_sparkSQL.R#1280 and @test_sparkSQL.R#1282) are modfied
|
Test build #71982 has finished for PR 16689 at commit
|
|
I have modified the codes and tests, including the existed tests @test_sparkSQL.R#1280 and @test_sparkSQL.R#1282. Like in local R, now NA column of the SparkDataFrame will also be collected as its corresponding type instead of logical NA. |
|
Just to make sure you see this: #16689 (comment) |
|
Test build #72182 has finished for PR 16689 at commit
|
|
Test build #72186 has finished for PR 16689 at commit
|
R/pkg/R/DataFrame.R
Outdated
| stopifnot(class(vec) != "list") | ||
| class(vec) <- | ||
| if (colType == "timestamp") | ||
| c("POSIXct", "POSIXt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should the class be c("POSIXct", "POSIXt") in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because PRIMITIVE_TYPES[["timestamp"]] is POSIXct, it usually comes with POSIXt together. POSIXt is virtual class used to allow operations such as subtraction to mix the two classes POSIXct and POSIXlt.
The previous convertion will also convert timestamp to c("POSIXct", "POSIXt").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should PRIMITIVE_TYPES[["timestamp"]] be changed then
https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks better if it won't affect other methods. I will try it. Thanks for the advice.
R/pkg/R/DataFrame.R
Outdated
| if (colType == "timestamp") | ||
| c("POSIXct", "POSIXt") | ||
| else | ||
| PRIMITIVE_TYPES[[colType]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by setting these instead of having it inferred - does this break any existing behavior? does any type differ because of this line of change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently all tests are passed, except for the two modified tests with NA types as discussed before. The followings are the all type convertions from SparkDataframe to R data.frame, which have been tested in the existing tests in test_sparkSQL.R.
PRIMITIVE_TYPES <- as.environment(list(
"tinyint" = "integer",
"smallint" = "integer",
"int" = "integer",
"bigint" = "numeric",
"float" = "numeric",
"double" = "numeric",
"decimal" = "numeric",
"string" = "character",
"binary" = "raw",
"boolean" = "logical",
"timestamp" = "POSIXct",
"date" = "Date",
# following types are not SQL types returned by dtypes(). They are listed here for usage
# by checkType() in schema.R.
# TODO: refactor checkType() in schema.R.
"byte" = "integer",
"integer" = "integer"
))
|
Test build #72250 has finished for PR 16689 at commit
|
|
I tried to modify the PRIMITIVE_TYPES for timestamp, but it had a side effect on coltypes method. In test_sparkSQL.R#2262, |
|
hmm, that's not a super big issue since vector and list is more or less the same in R. let me try to find some time to test this out? thanks! |
|
Test build #72378 has finished for PR 16689 at commit
|
|
Thanks. I tried to fix the method |
|
hmm, this seems like a reasonable approach. With these changes:
@shivaram what do you think? |
|
@felixcheung @titicaca Just to make sure I understand, collect on timestamp was getting |
|
Yes, collect on timestamp was getting |
|
Ok - I think this sounds good then ! @felixcheung Let me know if you want me to take a look at the code as well or if not feel free to merge when you think its ready |
|
great, this is a good catch and thank you for fixing this @titicaca |
…stamp column
## What changes were proposed in this pull request?
Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
```
library(SparkR)
sparkR.session(master = "local")
df <- data.frame(col1 = c(0, 1, 2),
col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))
sdf1 <- createDataFrame(df)
print(dtypes(sdf1))
df1 <- collect(sdf1)
print(lapply(df1, class))
sdf2 <- filter(sdf1, "col1 > 0")
print(dtypes(sdf2))
df2 <- collect(sdf2)
print(lapply(df2, class))
```
As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.
Therefore, we need to cast the data type of the vector explicitly.
## How was this patch tested?
The patch can be tested manually with the same code above.
Author: titicaca <[email protected]>
Closes #16689 from titicaca/sparkr-dev.
(cherry picked from commit bc0a0e6)
Signed-off-by: Felix Cheung <[email protected]>
|
@titicaca do you have a JIRA id on https://issues.apache.org? We would resolve the bug to you. |
|
Yes. The JIRA id is SPARK-19342. This is my first commit to SPARK project. Thank you for the help and advices :) |
|
@titicaca he means, what is your user ID on JIRA? so we can credit you. It's clear what the JIRA is. |
|
Thanks for the reminder. I may have forgotten to mention that I am the reporter of this JIRA bug. My JIRA ID is also titicaca. Thank you! |
|
Great, done! |
…stamp column
## What changes were proposed in this pull request?
Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
```
library(SparkR)
sparkR.session(master = "local")
df <- data.frame(col1 = c(0, 1, 2),
col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))
sdf1 <- createDataFrame(df)
print(dtypes(sdf1))
df1 <- collect(sdf1)
print(lapply(df1, class))
sdf2 <- filter(sdf1, "col1 > 0")
print(dtypes(sdf2))
df2 <- collect(sdf2)
print(lapply(df2, class))
```
As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.
Therefore, we need to cast the data type of the vector explicitly.
## How was this patch tested?
The patch can be tested manually with the same code above.
Author: titicaca <[email protected]>
Closes apache#16689 from titicaca/sparkr-dev.
What changes were proposed in this pull request?
Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
This is caused by method
do.call(c, list), if we convert a list, i.e.do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01")), the class of the result is numeric instead of POSIXct.Therefore, we need to cast the data type of the vector explicitly.
How was this patch tested?
The patch can be tested manually with the same code above.