[SPARK-29313][SQL] Fix failure on writing to `noop` in benchmarks #25988

MaxGekk · 2019-10-01T10:31:33Z

What changes were proposed in this pull request?

In the PR, I propose to specify the save mode explicitly while writing to the noop datasource in benchmarks. I set Overwrite mode in the following benchmarks:

JsonBenchmark
CSVBenchmark
UDFBenchmark
MakeDateTimeBenchmark
ExtractBenchmark
DateTimeBenchmark
NestedSchemaPruningBenchmark

Why are the changes needed?

Otherwise writing to noop fails with:

[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.;
[error] 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284)

most likely due to #25876

Does this PR introduce any user-facing change?

No

How was this patch tested?

I generated results of ExtractBenchmark via the command:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark"

MaxGekk · 2019-10-01T10:32:45Z

@cloud-fan @brkyvz @rdblue Please, take a look at the PR.

srowen

Seems OK to me. I wonder why noop fails in any mode? you could argue the mode is just irrelevant. In particular the 'output' never 'exists' in noop mode, conceptually.

MaxGekk · 2019-10-01T13:34:51Z

I agree that the save mode for noop doesn't matter but for default mode

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Line 844 in c8159c7

private var mode: SaveMode = SaveMode.ErrorIfExists

there is no case currently

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Lines 270 to 287 in c8159c7

    
           mode match { 
        
             case SaveMode.Append => 
        
               runCommand(df.sparkSession, "save") { 
        
                 AppendData.byName(relation, df.logicalPlan, extraOptions.toMap) 
        
               } 
        
             case SaveMode.Overwrite if table.supportsAny(TRUNCATE, OVERWRITE_BY_FILTER) => 
        
               // truncate the table 
        
               runCommand(df.sparkSession, "save") { 
        
                 OverwriteByExpression.byName( 
        
                   relation, df.logicalPlan, Literal(true), extraOptions.toMap) 
        
               } 
        
             case other => 
        
               throw new AnalysisException(s"TableProvider implementation $source cannot be " + 
        
                 s"written with $other mode, please use Append or Overwrite " + 
        
                 "modes instead.") 
        
           }

. Maybe it should be implemented in more generic way.

SparkQA · 2019-10-01T14:05:11Z

Test build #111643 has finished for PR 25988 at commit ec104fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-10-02T03:36:13Z

I'm not sure about this. For me, we need to fix the root cause inside DSv2.

dongjoon-hyun · 2019-10-02T03:38:36Z

@brkyvz and @cloud-fan . Is this change intentional?

cloud-fan · 2019-10-02T03:58:11Z

we are going to support all save modes in DataFrameWriter. For now this change is OK to me to unblock benchmark changes.

dongjoon-hyun · 2019-10-02T04:04:08Z

Thanks. Then, I'll merge this. This will unblock the benchmarks.

dongjoon-hyun · 2019-10-02T04:04:18Z

Merged to master.

dongjoon-hyun · 2019-10-02T04:05:44Z

Thank you, @MaxGekk , @cloud-fan , @srowen .

rdblue · 2019-10-02T21:01:41Z

Sorry I'm just now getting to look at this. @cloud-fan, any idea why v2 was used here for these sources? Built-in v2 implementations should be disabled, so it is concerning that they are inadvertently used, right?

cloud-fan · 2019-10-03T01:04:01Z

we only disable file source v2, but noop souce is not file source.

MaxGekk added 9 commits October 1, 2019 13:09

Fix JSONBenchmark

bea5685

Fix imports in JSONBenchmark

967937d

Fix CSVBenchmark

6b04178

Fix UDFBenchmark

709cd8a

Fix MakeDateTimeBenchmark

04b4459

Fix ExtractBenchmark

4928355

Fix DateTimeBenchmark

61cd40a

Fix DateTimeBenchmark

d58d126

Fix NestedSchemaPruningBenchmark

ec104fa

srowen reviewed Oct 1, 2019

View reviewed changes

dongjoon-hyun added the SQL label Oct 1, 2019

dongjoon-hyun closed this in 3b1674c Oct 2, 2019

MaxGekk deleted the noop-overwrite-mode branch October 5, 2019 19:18

[SPARK-29313][SQL] Fix failure on writing to noop in benchmarks #25988

[SPARK-29313][SQL] Fix failure on writing to noop in benchmarks #25988

Uh oh!

Conversation

MaxGekk commented Oct 1, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Oct 1, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Oct 1, 2019

Uh oh!

SparkQA commented Oct 1, 2019

Uh oh!

dongjoon-hyun commented Oct 2, 2019

Uh oh!

dongjoon-hyun commented Oct 2, 2019

Uh oh!

cloud-fan commented Oct 2, 2019

Uh oh!

dongjoon-hyun commented Oct 2, 2019

Uh oh!

dongjoon-hyun commented Oct 2, 2019

Uh oh!

dongjoon-hyun commented Oct 2, 2019

Uh oh!

rdblue commented Oct 2, 2019

Uh oh!

cloud-fan commented Oct 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-29313][SQL] Fix failure on writing to `noop` in benchmarks #25988

[SPARK-29313][SQL] Fix failure on writing to `noop` in benchmarks #25988