Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 1, 2019

What changes were proposed in this pull request?

In the PR, I propose to specify the save mode explicitly while writing to the noop datasource in benchmarks. I set Overwrite mode in the following benchmarks:

  • JsonBenchmark
  • CSVBenchmark
  • UDFBenchmark
  • MakeDateTimeBenchmark
  • ExtractBenchmark
  • DateTimeBenchmark
  • NestedSchemaPruningBenchmark

Why are the changes needed?

Otherwise writing to noop fails with:

[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.;
[error] 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284)

most likely due to #25876

Does this PR introduce any user-facing change?

No

How was this patch tested?

I generated results of ExtractBenchmark via the command:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark"

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 1, 2019

@cloud-fan @brkyvz @rdblue Please, take a look at the PR.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK to me. I wonder why noop fails in any mode? you could argue the mode is just irrelevant. In particular the 'output' never 'exists' in noop mode, conceptually.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 1, 2019

I agree that the save mode for noop doesn't matter but for default mode

private var mode: SaveMode = SaveMode.ErrorIfExists
there is no case currently
mode match {
case SaveMode.Append =>
runCommand(df.sparkSession, "save") {
AppendData.byName(relation, df.logicalPlan, extraOptions.toMap)
}
case SaveMode.Overwrite if table.supportsAny(TRUNCATE, OVERWRITE_BY_FILTER) =>
// truncate the table
runCommand(df.sparkSession, "save") {
OverwriteByExpression.byName(
relation, df.logicalPlan, Literal(true), extraOptions.toMap)
}
case other =>
throw new AnalysisException(s"TableProvider implementation $source cannot be " +
s"written with $other mode, please use Append or Overwrite " +
"modes instead.")
}
. Maybe it should be implemented in more generic way.

@SparkQA
Copy link

SparkQA commented Oct 1, 2019

Test build #111643 has finished for PR 25988 at commit ec104fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

I'm not sure about this. For me, we need to fix the root cause inside DSv2.

@dongjoon-hyun
Copy link
Member

@brkyvz and @cloud-fan . Is this change intentional?

@cloud-fan
Copy link
Contributor

we are going to support all save modes in DataFrameWriter. For now this change is OK to me to unblock benchmark changes.

@dongjoon-hyun
Copy link
Member

Thanks. Then, I'll merge this. This will unblock the benchmarks.

@dongjoon-hyun
Copy link
Member

Merged to master.

@dongjoon-hyun
Copy link
Member

Thank you, @MaxGekk , @cloud-fan , @srowen .

@rdblue
Copy link
Contributor

rdblue commented Oct 2, 2019

Sorry I'm just now getting to look at this. @cloud-fan, any idea why v2 was used here for these sources? Built-in v2 implementations should be disabled, so it is concerning that they are inadvertently used, right?

@cloud-fan
Copy link
Contributor

we only disable file source v2, but noop souce is not file source.

@MaxGekk MaxGekk deleted the noop-overwrite-mode branch October 5, 2019 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants