-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() #29543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #127898 has finished for PR 29543 at commit
|
|
retest this please |
|
cc @cloud-fan |
| if (!df.sparkSession.sessionState.conf.legacyPathOptionBehavior && | ||
| extraOptions.contains("path") && path.nonEmpty) { | ||
| throw new AnalysisException("There is a 'path' option set and save() is called with a path " + | ||
| "parameter. Either remove the path option, or call save() without the parameter.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also mention the legacy config in the error message?
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one comment.
|
Test build #127910 has finished for PR 29543 at commit
|
|
retest this please |
|
Test build #127916 has finished for PR 29543 at commit
|
|
Test build #127932 has finished for PR 29543 at commit
|
|
thanks, merging to master! |
| */ | ||
| def save(path: String): Unit = { | ||
| if (!df.sparkSession.sessionState.conf.legacyPathOptionBehavior && | ||
| extraOptions.contains("path") && path.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path here is a String, do we really need to check path.nonEmpty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think we need to remove the check. The output is more confusing since path option is also set:
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("")
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)I will create a PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #29697
| spark.readStream | ||
| .format("org.apache.spark.sql.streaming.test") | ||
| .option("path", "tmp1") | ||
| .load("tmp2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be flaky. The directory of tmp2 could be non-empty and contains illegal data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using org.apache.spark.sql.streaming.test as a format, which has a no-op implementation:
spark/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala
Lines 58 to 59 in f7995c5
| /** Dummy provider: returns no-op source/sink and records options in [[LastOptions]]. */ | |
| class DefaultSource extends StreamSourceProvider with StreamSinkProvider { |
This datasource is being used throughout this test suite with non-existent dirs:
spark/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala
Lines 227 to 231 in f7995c5
| test("stream paths") { | |
| val df = spark.readStream | |
| .format("org.apache.spark.sql.streaming.test") | |
| .option("checkpointLocation", newMetadataDir) | |
| .load("/test") |
Am I missing something? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I think this is not an issue here.
…is empty for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() ### What changes were proposed in this pull request? This PR is a follow up to #29543 (comment), which correctly points out that the check for the empty string is not necessary. ### Why are the changes needed? The unnecessary check actually could cause more confusion. For example, ```scala scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("") java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168) ``` even when `path` option is available. This PR addresses to fix this confusion. ### Does this PR introduce _any_ user-facing change? Yes, now the above example prints the consistent exception message whether the path parameter value is empty or not. ```scala scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("") org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.; at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:290) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:856) ... 47 elided ``` ### How was this patch tested? Added unit tests. Closes #29697 from imback82/SPARK-32516-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
) Spark doesn't allow duplicated option `path` when `DataFrameWriter.save()` since Spark 3.1. (apache/spark#29543)
…912) Spark doesn't allow duplicated option `path` when `DataFrameWriter.save()` since Spark 3.1. (apache/spark#29543)
What changes were proposed in this pull request?
This is a follow up PR to #29328 to apply the same constraint where
pathoption cannot coexist with path parameter toDataFrameWriter.save(),DataStreamReader.load()andDataStreamWriter.start().Why are the changes needed?
The current behavior silently overwrites the
pathoption if path parameter is passed toDataFrameWriter.save(),DataStreamReader.load()andDataStreamWriter.start().For example,
will write the result to
/tmp/path2.Does this PR introduce any user-facing change?
Yes, if
pathoption coexists with path parameter to any of the above methods, it will throwAnalysisException:The user can restore the previous behavior by setting
spark.sql.legacy.pathOptionBehavior.enabledtotrue.How was this patch tested?
Added new tests.