-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters #29328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I also see that Should I just fix this inconsistency instead of this PR? If so, what should be the expected behavior (just ignore cc: @cloud-fan @dongjoon-hyun @viirya (Thanks in advance!) |
|
Test build #126945 has finished for PR 29328 at commit
|
|
FYI, the existing test to We expect the same behavior b/w these two APIs right? |
|
Test build #126949 has finished for PR 29328 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Show resolved
Hide resolved
|
Test build #127120 has finished for PR 29328 at commit
|
|
Test build #127118 has finished for PR 29328 at commit
|
|
Test build #127121 has finished for PR 29328 at commit
|
|
Test build #127153 has finished for PR 29328 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Outdated
Show resolved
Hide resolved
|
Test build #127346 has finished for PR 29328 at commit
|
|
Test build #127350 has finished for PR 29328 at commit
|
|
LGTM except one question: do we still need this fix? https://github.com/apache/spark/pull/29328/files#r468330253 |
Yes, without it, the following tests fail: Btw, this is an existing bug. For example, for |
|
@cloud-fan, should we also make the behavior the same on the |
Can we have a separate PR for this bug? Then we can backport the bug fix.
This is a good point. We never document the overwrite behavior anywhere and people may not realize the path option is overwritten by the |
docs/sql-migration-guide.md
Outdated
|
|
||
| - In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`. | ||
|
|
||
| - In Spark 3.1, when loading a dataframe, `path` option cannot coexist with `load()`'s path parameters. For example, `spark.read.format("csv").option("path", "/tmp").load("/tmp2")` or `spark.read.option("path", "/tmp").csv("/tmp2")` will throw `org.apache.spark.sql.AnalysisException`. In Spark version 3.0 and below, `path` option is overwritten if one path parameter is passed to `load()`, or `path` option is added to the overall paths if multiple path parameters are passed to `load()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should paths be mentioned too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned paths as well.
|
|
||
| if ((extraOptions.contains("path") || extraOptions.contains("paths")) && paths.nonEmpty) { | ||
| throw new AnalysisException("There is a 'path' or 'paths' option set and load() is called " + | ||
| "with path parameters. Either remove the option or put it into the load() parameters.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either remove the option or not put it into the load() parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the suggestion from #29328 (comment), but let me know if the current message is still vague. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make it clearer:
Either remove the path option if it's the same as the path parameter, or add it
to the load() parameter if you do want to read multiple paths.
| paths = paths, | ||
| className = classOf[TextFileFormat].getName, | ||
| options = options.parameters | ||
| options = options.parameters - "path" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments here at least? I feel it will confuse code readers later on why removing it...
|
Test build #127671 has finished for PR 29328 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
Show resolved
Hide resolved
|
Test build #127674 has finished for PR 29328 at commit
|
|
Test build #127698 has finished for PR 29328 at commit
|
|
Sorry, I forgot about one thing: we need a legacy config to restore to the previous behavior, in case users don't want to change their queries when upgrading Spark. @imback82 can you add it? |
| // force invocation of `load(...varargs...)` | ||
| option("path", path).load(Seq.empty: _*) | ||
| if (sparkSession.sessionState.conf.legacyPathOptionBehavior) { | ||
| option("path", path).load(Seq.empty: _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to restore the legacy behavior. Note that the legacy behavior is different for the following:
spark.read.option("path", path).parquet(path)=> path is read twicespark.read.format("parquet").option("path", path).load(path)=> path is read once (option is overwritten)
|
Test build #127789 has finished for PR 29328 at commit
|
|
Test build #127790 has finished for PR 29328 at commit
|
|
thanks, merging to master! |
|
@imback82 please open a PR to do the same check for |
|
Yes, will do. Thanks for the review! |
…arameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() ### What changes were proposed in this pull request? This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. ### Why are the changes needed? The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. For example, ``` Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") ``` will write the result to `/tmp/path2`. ### Does this PR introduce _any_ user-facing change? Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`: ``` scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.; ``` The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`. ### How was this patch tested? Added new tests. Closes #29543 from imback82/path_option. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…der/Writer should not fail ### What changes were proposed in this pull request? This is a followup of #29328 In #29328 , we forbid the use case that path option and path parameter are both specified. However, it breaks some use cases: ``` val dfr = spark.read.format(...).option(...) dfr.load(path1).xxx dfr.load(path2).xxx ``` The reason is that: `load` has side effects. It will set path option to the `DataFrameReader` instance. The next time you call `load`, Spark will fail because both path option and path parameter are specified. This PR removes the side effect of `save`/`load`/`start` to not set the path option. ### Why are the changes needed? recover some use cases ### Does this PR introduce _any_ user-facing change? Yes, some use cases fail before this PR, and can run successfully after this PR. ### How was this patch tested? new tests Closes #29723 from cloud-fan/df. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR proposes to make the behavior consistent for the
pathoption when loading dataframes with a single path (e.g,option("path", path).format("parquet").load(path)vs.option("path", path).parquet(path)) by disallowingpathoption to coexist withload's path parameters.Why are the changes needed?
The current behavior is inconsistent:
Does this PR introduce any user-facing change?
Yes, now if the
pathoption is specified along withload's path parameters, it would fail:The user can restore the previous behavior by setting
spark.sql.legacy.pathOptionBehavior.enabledtotrue.How was this patch tested?
Added a test