Skip to content

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Aug 25, 2020

What changes were proposed in this pull request?

This is a follow up PR to #29328 to apply the same constraint where path option cannot coexist with path parameter to DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start().

Why are the changes needed?

The current behavior silently overwrites the path option if path parameter is passed to DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start().

For example,

Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")

will write the result to /tmp/path2.

Does this PR introduce any user-facing change?

Yes, if path option coexists with path parameter to any of the above methods, it will throw AnalysisException:

scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a  path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;

The user can restore the previous behavior by setting spark.sql.legacy.pathOptionBehavior.enabled to true.

How was this patch tested?

Added new tests.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127898 has finished for PR 29543 at commit f3b3c98.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imback82
Copy link
Contributor Author

retest this please

@HyukjinKwon
Copy link
Member

cc @cloud-fan

if (!df.sparkSession.sessionState.conf.legacyPathOptionBehavior &&
extraOptions.contains("path") && path.nonEmpty) {
throw new AnalysisException("There is a 'path' option set and save() is called with a path " +
"parameter. Either remove the path option, or call save() without the parameter.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also mention the legacy config in the error message?

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127910 has finished for PR 29543 at commit f3b3c98.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127916 has finished for PR 29543 at commit f3b3c98.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127932 has finished for PR 29543 at commit 261b609.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in baaa756 Aug 27, 2020
*/
def save(path: String): Unit = {
if (!df.sparkSession.sessionState.conf.legacyPathOptionBehavior &&
extraOptions.contains("path") && path.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path here is a String, do we really need to check path.nonEmpty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think we need to remove the check. The output is more confusing since path option is also set:

scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("")
java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)

I will create a PR for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #29697

spark.readStream
.format("org.apache.spark.sql.streaming.test")
.option("path", "tmp1")
.load("tmp2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be flaky. The directory of tmp2 could be non-empty and contains illegal data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using org.apache.spark.sql.streaming.test as a format, which has a no-op implementation:

/** Dummy provider: returns no-op source/sink and records options in [[LastOptions]]. */
class DefaultSource extends StreamSourceProvider with StreamSinkProvider {

This datasource is being used throughout this test suite with non-existent dirs:

test("stream paths") {
val df = spark.readStream
.format("org.apache.spark.sql.streaming.test")
.option("checkpointLocation", newMetadataDir)
.load("/test")

Am I missing something? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I think this is not an issue here.

cloud-fan pushed a commit that referenced this pull request Sep 10, 2020
…is empty for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start()

### What changes were proposed in this pull request?

This PR is a follow up to #29543 (comment), which correctly points out that the check for the empty string is not necessary.

### Why are the changes needed?

The unnecessary check actually could cause more confusion.

For example,
```scala
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("")
java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)
```
even when `path` option is available. This PR addresses to fix this confusion.

### Does this PR introduce _any_ user-facing change?

Yes, now the above example prints the consistent exception message whether the path parameter value is empty or not.
```scala
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:290)
  at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:856)
  ... 47 elided
```

### How was this patch tested?

Added unit tests.

Closes #29697 from imback82/SPARK-32516-followup.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ueshin added a commit to databricks/koalas that referenced this pull request Nov 13, 2020
)

Spark doesn't allow duplicated option `path` when `DataFrameWriter.save()` since Spark 3.1. (apache/spark#29543)
rising-star92 added a commit to rising-star92/databricks-koalas that referenced this pull request Jan 27, 2023
…912)

Spark doesn't allow duplicated option `path` when `DataFrameWriter.save()` since Spark 3.1. (apache/spark#29543)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants