Skip to content

Commit a848d55

Browse files
HyukjinKwonueshin
authored andcommitted
[SPARK-21264][PYTHON] Call cross join path in join without 'on' and with 'how'
## What changes were proposed in this pull request? Currently, it throws a NPE when missing columns but join type is speicified in join at PySpark as below: ```python spark.conf.set("spark.sql.crossJoin.enabled", "false") spark.range(1).join(spark.range(1), how="inner").show() ``` ``` Traceback (most recent call last): ... py4j.protocol.Py4JJavaError: An error occurred while calling o66.join. : java.lang.NullPointerException at org.apache.spark.sql.Dataset.join(Dataset.scala:931) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... ``` ```python spark.conf.set("spark.sql.crossJoin.enabled", "true") spark.range(1).join(spark.range(1), how="inner").show() ``` ``` ... py4j.protocol.Py4JJavaError: An error occurred while calling o84.join. : java.lang.NullPointerException at org.apache.spark.sql.Dataset.join(Dataset.scala:931) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... ``` This PR suggests to follow Scala's one as below: ```scala scala> spark.conf.set("spark.sql.crossJoin.enabled", "false") scala> spark.range(1).join(spark.range(1), Seq.empty[String], "inner").show() ``` ``` org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans Range (0, 1, step=1, splits=Some(8)) and Range (0, 1, step=1, splits=Some(8)) Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.; ... ``` ```scala scala> spark.conf.set("spark.sql.crossJoin.enabled", "true") scala> spark.range(1).join(spark.range(1), Seq.empty[String], "inner").show() ``` ``` +---+---+ | id| id| +---+---+ | 0| 0| +---+---+ ``` **After** ```python spark.conf.set("spark.sql.crossJoin.enabled", "false") spark.range(1).join(spark.range(1), how="inner").show() ``` ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: u'Detected cartesian product for INNER join between logical plans\nRange (0, 1, step=1, splits=Some(8))\nand\nRange (0, 1, step=1, splits=Some(8))\nJoin condition is missing or trivial.\nUse the CROSS JOIN syntax to allow cartesian products between these relations.;' ``` ```python spark.conf.set("spark.sql.crossJoin.enabled", "true") spark.range(1).join(spark.range(1), how="inner").show() ``` ``` +---+---+ | id| id| +---+---+ | 0| 0| +---+---+ ``` ## How was this patch tested? Added tests in `python/pyspark/sql/tests.py`. Author: hyukjinkwon <[email protected]> Closes #18484 from HyukjinKwon/SPARK-21264.
1 parent 6657e00 commit a848d55

File tree

2 files changed

+18
-0
lines changed

2 files changed

+18
-0
lines changed

python/pyspark/sql/dataframe.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -833,6 +833,8 @@ def join(self, other, on=None, how=None):
833833
else:
834834
if how is None:
835835
how = "inner"
836+
if on is None:
837+
on = self._jseq([])
836838
assert isinstance(how, basestring), "how should be basestring"
837839
jdf = self._jdf.join(other._jdf, on, how)
838840
return DataFrame(jdf, self.sql_ctx)

python/pyspark/sql/tests.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2021,6 +2021,22 @@ def test_toDF_with_schema_string(self):
20212021
self.assertEqual(df.schema.simpleString(), "struct<value:int>")
20222022
self.assertEqual(df.collect(), [Row(key=i) for i in range(100)])
20232023

2024+
def test_join_without_on(self):
2025+
df1 = self.spark.range(1).toDF("a")
2026+
df2 = self.spark.range(1).toDF("b")
2027+
2028+
try:
2029+
self.spark.conf.set("spark.sql.crossJoin.enabled", "false")
2030+
self.assertRaises(AnalysisException, lambda: df1.join(df2, how="inner").collect())
2031+
2032+
self.spark.conf.set("spark.sql.crossJoin.enabled", "true")
2033+
actual = df1.join(df2, how="inner").collect()
2034+
expected = [Row(a=0, b=0)]
2035+
self.assertEqual(actual, expected)
2036+
finally:
2037+
# We should unset this. Otherwise, other tests are affected.
2038+
self.spark.conf.unset("spark.sql.crossJoin.enabled")
2039+
20242040
# Regression test for invalid join methods when on is None, Spark-14761
20252041
def test_invalid_join_method(self):
20262042
df1 = self.spark.createDataFrame([("Alice", 5), ("Bob", 8)], ["name", "age"])

0 commit comments

Comments
 (0)