Skip to content

Commit 7fb9f68

Browse files
zero323HyukjinKwon
authored andcommitted
[SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName
### What changes were proposed in this pull request? Add optional `allowMissingColumns` argument to SparkR `unionByName`. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? `unionByName` supports `allowMissingColumns`. ### How was this patch tested? Existing unit tests. New unit tests targeting this feature. Closes apache#29813 from zero323/SPARK-32799. Authored-by: zero323 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent f893a19 commit 7fb9f68

File tree

5 files changed

+34
-12
lines changed

5 files changed

+34
-12
lines changed

R/pkg/R/DataFrame.R

+12-2
Original file line numberDiff line numberDiff line change
@@ -2863,11 +2863,18 @@ setMethod("unionAll",
28632863
#' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are not taken
28642864
#' into account. Input SparkDataFrames can have different data types in the schema.
28652865
#'
2866+
#' When the parameter allowMissingColumns is `TRUE`, the set of column names
2867+
#' in x and y can differ; missing columns will be filled as null.
2868+
#' Further, the missing columns of x will be added at the end
2869+
#' in the schema of the union result.
2870+
#'
28662871
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
28672872
#' This function resolves columns by name (not by position).
28682873
#'
28692874
#' @param x A SparkDataFrame
28702875
#' @param y A SparkDataFrame
2876+
#' @param allowMissingColumns logical
2877+
#' @param ... further arguments to be passed to or from other methods.
28712878
#' @return A SparkDataFrame containing the result of the union.
28722879
#' @family SparkDataFrame functions
28732880
#' @rdname unionByName
@@ -2880,12 +2887,15 @@ setMethod("unionAll",
28802887
#' df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
28812888
#' df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
28822889
#' head(unionByName(df1, df2))
2890+
#'
2891+
#' df3 <- select(createDataFrame(mtcars), "carb")
2892+
#' head(unionByName(df1, df3, allowMissingColumns = TRUE))
28832893
#' }
28842894
#' @note unionByName since 2.3.0
28852895
setMethod("unionByName",
28862896
signature(x = "SparkDataFrame", y = "SparkDataFrame"),
2887-
function(x, y) {
2888-
unioned <- callJMethod(x@sdf, "unionByName", y@sdf)
2897+
function(x, y, allowMissingColumns=FALSE) {
2898+
unioned <- callJMethod(x@sdf, "unionByName", y@sdf, allowMissingColumns)
28892899
dataFrame(unioned)
28902900
})
28912901

R/pkg/R/generics.R

+1-1
Original file line numberDiff line numberDiff line change
@@ -638,7 +638,7 @@ setGeneric("union", function(x, y) { standardGeneric("union") })
638638
setGeneric("unionAll", function(x, y) { standardGeneric("unionAll") })
639639

640640
#' @rdname unionByName
641-
setGeneric("unionByName", function(x, y) { standardGeneric("unionByName") })
641+
setGeneric("unionByName", function(x, y, ...) { standardGeneric("unionByName") })
642642

643643
#' @rdname unpersist
644644
setGeneric("unpersist", function(x, ...) { standardGeneric("unpersist") })

R/pkg/tests/fulltests/test_sparkSQL.R

+13
Original file line numberDiff line numberDiff line change
@@ -2696,6 +2696,19 @@ test_that("union(), unionByName(), rbind(), except(), and intersect() on a DataF
26962696
expect_error(rbind(df, df2, df3),
26972697
"Names of input data frames are different.")
26982698

2699+
2700+
df4 <- unionByName(df2, select(df2, "age"), TRUE)
2701+
2702+
expect_equal(
2703+
sum(collect(
2704+
select(df4, alias(isNull(df4$name), "missing_name")
2705+
))$missing_name),
2706+
3
2707+
)
2708+
2709+
testthat::expect_error(unionByName(df2, select(df2, "age"), FALSE))
2710+
testthat::expect_error(unionByName(df2, select(df2, "age")))
2711+
26992712
excepted <- arrange(except(df, df2), desc(df$age))
27002713
expect_is(unioned, "SparkDataFrame")
27012714
expect_equal(count(excepted), 2)

python/pyspark/sql/dataframe.py

+4-5
Original file line numberDiff line numberDiff line change
@@ -1569,11 +1569,10 @@ def unionByName(self, other, allowMissingColumns=False):
15691569
| 6| 4| 5|
15701570
+----+----+----+
15711571
1572-
When the parameter `allowMissingColumns` is ``True``,
1573-
this function allows different set of column names between two :class:`DataFrame`\\s.
1574-
Missing columns at each side, will be filled with null values.
1575-
The missing columns at left :class:`DataFrame` will be added at the end in the schema
1576-
of the union result:
1572+
When the parameter `allowMissingColumns` is ``True``, the set of column names
1573+
in this and other :class:`DataFrame` can differ; missing columns will be filled with null.
1574+
Further, the missing columns of this :class:`DataFrame` will be added at the end
1575+
in the schema of the union result:
15771576
15781577
>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
15791578
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+4-4
Original file line numberDiff line numberDiff line change
@@ -2038,10 +2038,10 @@ class Dataset[T] private[sql](
20382038
* The difference between this function and [[union]] is that this function
20392039
* resolves columns by name (not by position).
20402040
*
2041-
* When the parameter `allowMissingColumns` is true, this function allows different set
2042-
* of column names between two Datasets. Missing columns at each side, will be filled with
2043-
* null values. The missing columns at left Dataset will be added at the end in the schema
2044-
* of the union result:
2041+
* When the parameter `allowMissingColumns` is `true`, the set of column names
2042+
* in this and other `Dataset` can differ; missing columns will be filled with null.
2043+
* Further, the missing columns of this `Dataset` will be added at the end
2044+
* in the schema of the union result:
20452045
*
20462046
* {{{
20472047
* val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")

0 commit comments

Comments
 (0)