[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() #29543

imback82 · 2020-08-25T22:43:03Z

What changes were proposed in this pull request?

This is a follow up PR to #29328 to apply the same constraint where path option cannot coexist with path parameter to DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start().

Why are the changes needed?

The current behavior silently overwrites the path option if path parameter is passed to DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start().

For example,

Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")

will write the result to /tmp/path2.

Does this PR introduce any user-facing change?

Yes, if path option coexists with path parameter to any of the above methods, it will throw AnalysisException:

scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a  path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;

The user can restore the previous behavior by setting spark.sql.legacy.pathOptionBehavior.enabled to true.

How was this patch tested?

Added new tests.

SparkQA · 2020-08-26T02:45:43Z

Test build #127898 has finished for PR 29543 at commit f3b3c98.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2020-08-26T02:47:25Z

retest this please

HyukjinKwon · 2020-08-26T04:01:41Z

cc @cloud-fan

cloud-fan · 2020-08-26T05:34:25Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+    if (!df.sparkSession.sessionState.conf.legacyPathOptionBehavior &&
+        extraOptions.contains("path") && path.nonEmpty) {
+      throw new AnalysisException("There is a 'path' option set and save() is called with a path " +
+        "parameter. Either remove the path option, or call save() without the parameter.")


shall we also mention the legacy config in the error message?

cloud-fan

LGTM except one comment.

SparkQA · 2020-08-26T07:05:01Z

Test build #127910 has finished for PR 29543 at commit f3b3c98.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-08-26T07:10:07Z

retest this please

SparkQA · 2020-08-26T11:17:40Z

Test build #127916 has finished for PR 29543 at commit f3b3c98.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-26T23:05:53Z

Test build #127932 has finished for PR 29543 at commit 261b609.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-27T06:21:05Z

thanks, merging to master!

cloud-fan · 2020-09-09T07:51:43Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

   */
  def save(path: String): Unit = {
+    if (!df.sparkSession.sessionState.conf.legacyPathOptionBehavior &&
+        extraOptions.contains("path") && path.nonEmpty) {


The path here is a String, do we really need to check path.nonEmpty?

Good catch. I think we need to remove the check. The output is more confusing since path option is also set:

scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("") java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)

I will create a PR for this.

Created #29697

gatorsmile · 2020-09-10T00:33:43Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

+        spark.readStream
+          .format("org.apache.spark.sql.streaming.test")
+          .option("path", "tmp1")
+          .load("tmp2")


This could be flaky. The directory of tmp2 could be non-empty and contains illegal data.

This is using org.apache.spark.sql.streaming.test as a format, which has a no-op implementation:

spark/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

Lines 58 to 59 in f7995c5

/** Dummy provider: returns no-op source/sink and records options in [[LastOptions]]. */

class DefaultSource extends StreamSourceProvider with StreamSinkProvider {

This datasource is being used throughout this test suite with non-existent dirs:

spark/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

Lines 227 to 231 in f7995c5

test("stream paths") {

val df = spark.readStream

.format("org.apache.spark.sql.streaming.test")

.option("checkpointLocation", newMetadataDir)

.load("/test")

Am I missing something? Thanks!

You are right, I think this is not an issue here.

…is empty for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() ### What changes were proposed in this pull request? This PR is a follow up to #29543 (comment), which correctly points out that the check for the empty string is not necessary. ### Why are the changes needed? The unnecessary check actually could cause more confusion. For example, ```scala scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("") java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168) ``` even when `path` option is available. This PR addresses to fix this confusion. ### Does this PR introduce _any_ user-facing change? Yes, now the above example prints the consistent exception message whether the path parameter value is empty or not. ```scala scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("") org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.; at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:290) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:856) ... 47 elided ``` ### How was this patch tested? Added unit tests. Closes #29697 from imback82/SPARK-32516-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

) Spark doesn't allow duplicated option `path` when `DataFrameWriter.save()` since Spark 3.1. (apache/spark#29543)

…912) Spark doesn't allow duplicated option `path` when `DataFrameWriter.save()` since Spark 3.1. (apache/spark#29543)

initial commit

5028d63

probot-autolabeler bot added DOCS SQL STRUCTURED STREAMING labels Aug 25, 2020

formatting

f3b3c98

cloud-fan reviewed Aug 26, 2020

View reviewed changes

cloud-fan approved these changes Aug 26, 2020

View reviewed changes

imback82 added 2 commits August 26, 2020 11:24

fix pyspark tests

06f889a

update comment to mention legacy config

261b609

probot-autolabeler bot added the PYTHON label Aug 26, 2020

cloud-fan closed this in baaa756 Aug 27, 2020

cloud-fan reviewed Sep 9, 2020

View reviewed changes

imback82 mentioned this pull request Sep 9, 2020

[SPARK-32516][SQL][FOLLOWUP] Remove unnecessary check if path string is empty for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() #29697

Closed

gatorsmile reviewed Sep 10, 2020

View reviewed changes

ueshin mentioned this pull request Nov 13, 2020

Fix to_csv to avoid duplicated option 'path' for DataFrameWriter. databricks/koalas#1912

Merged

ueshin added a commit to databricks/koalas that referenced this pull request Nov 13, 2020

Fix to_csv to avoid duplicated option 'path' for DataFrameWriter. (#1912

d251264

) Spark doesn't allow duplicated option `path` when `DataFrameWriter.save()` since Spark 3.1. (apache/spark#29543)

	/** Dummy provider: returns no-op source/sink and records options in [[LastOptions]]. */
	class DefaultSource extends StreamSourceProvider with StreamSinkProvider {

	test("stream paths") {
	val df = spark.readStream
	.format("org.apache.spark.sql.streaming.test")
	.option("checkpointLocation", newMetadataDir)
	.load("/test")

[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() #29543

[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() #29543

Uh oh!

Conversation

imback82 commented Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

imback82 commented Aug 26, 2020

Uh oh!

HyukjinKwon commented Aug 26, 2020

Uh oh!

cloud-fan Aug 26, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

HyukjinKwon commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

SparkQA commented Aug 26, 2020

Uh oh!

cloud-fan commented Aug 27, 2020

Uh oh!

cloud-fan Sep 9, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Sep 9, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Sep 9, 2020

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

imback82 commented Aug 25, 2020 •

edited

Loading