Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -1191,6 +1191,9 @@ setMethod("collect",
vec <- do.call(c, col)
stopifnot(class(vec) != "list")
class(vec) <- PRIMITIVE_TYPES[[colType]]
if (stringsAsFactors && is.character(vec)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance maybe it is better to reverse the order of checks: is.character(vec) && stringsAsFactors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks.

vec <- as.factor(vec)
}
df[[colIndex]] <- vec
} else {
df[[colIndex]] <- col
Expand Down
6 changes: 6 additions & 0 deletions R/pkg/tests/fulltests/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
})

test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please verify that factor orders are identical. I wonder if expect_equal really verifies order of values in a factor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> # Ordered vs unordered
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=TRUE)
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"), ordered=FALSE)
> expect_equal(or, or1)
error: `or` not equal to `or1`.
Attributes: < Componentclass: Lengths (2, 1) differ (string compare on first 1) >
Attributes: < Componentclass: 1 string mismatch >
> # level order mismatch
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Lo", "Med", "Hi"))
> expect_equal(or, or1)
error: `or` not equal to `or1`.
Attributes: < Componentlevels: 3 string mismatches >
# Data order mismatch
> or <- factor(c("Lo", "Hi", "Med", "Med", "Hi"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> expect_equal(or, or1)
error: `or` not equal to `or1`.
4 string mismatches
> or <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> or1 <- factor(c("Hi", "Med", "Med", "Hi", "Lo"), levels=c("Hi", "Lo", "Med"))
> expect_equal(or, or1)

Would this test address your concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: I think iris data frame all Species values are clustered together. That is why the test is passing (the new factor order ends up being identical to the existing order).

df <- suppressWarnings(collect(createDataFrame(iris), stringsAsFactors = TRUE))
expect_equal(class(iris$Species), class(df$Species))
expect_equal(iris$Species, df$Species)
})

test_that("SPARK-17811: can create DataFrame containing NA as date and time", {
df <- data.frame(
id = 1:2,
Expand Down