[SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error #22123

koertkuipers · 2018-08-16T15:58:25Z

What changes were proposed in this pull request?

When column pruning is turned on the checking of headers in the csv should only be for the fields in the requiredSchema, not the dataSchema, because column pruning means only requiredSchema is read.

How was this patch tested?

Added 2 unit tests where column pruning is turned on/off and csv headers are checked againt schema

Please review http://spark.apache.org/contributing.html before opening a pull request.

…iredSchema not dataSchema

SparkQA · 2018-08-16T19:30:10Z

Test build #94854 has finished for PR 22123 at commit c4179a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-17T02:06:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+  }
+
+  test("SPARK-25134: check header on parsing of dataset with projection and no column pruning") {
+    withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") {


I think false case test can be removed.

ok will remove

SparkQA · 2018-08-17T07:05:01Z

Test build #94875 has finished for PR 22123 at commit 09c986c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

koertkuipers · 2018-08-17T17:19:30Z

Test Result (1 failure / +1)
    org.apache.spark.sql.streaming.FlatMapGroupsWithStateSuite.flatMapGroupsWithState - streaming with processing time timeout - state format version 1

failure seems unrelated to this pullreq

gatorsmile · 2018-08-18T16:30:47Z

cc @MaxGekk

MaxGekk · 2018-08-18T18:01:13Z

May I ask you check the multiLine mode additionally since we use different methods of uniVocity parser. When multiLine is disabled, the parseLine method is used but in the multiLine mode:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

Lines 303 to 307 in a8a1ac0

    
           if (shouldDropHeader) { 
        
             val firstRecord = tokenizer.parseNext() 
        
             checkHeader(firstRecord) 
        
           } 
        
           tokenizer.parseNext()

MaxGekk

LGTM, it would be nice add a couple tests

MaxGekk · 2018-08-18T17:35:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+          .option("header", true)
+          .option("enforceSchema", false)
+          .load(dir)
+          .select("columnA"),


Could you check a corner case when required Schema is empty. For example, .option("enforceSchema", false) + count().

SparkQA · 2018-08-19T17:19:08Z

Test build #94935 has finished for PR 22123 at commit cd18ed2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-20T03:08:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

      // Note: if there are only comments in the first block, the header would probably
      // be not extracted.
      CSVUtils.extractHeader(lines, parser.options).foreach { header =>
        CSVDataSource.checkHeader(


Can we remove CSVDataSource.checkHeader and switch to CSVDataSource.checkHeaderColumnNames? Looks CSVDataSource.checkHeader is an overkill and makes hard to read the code.

gatorsmile · 2018-08-20T14:32:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

-          dataSchema,
+        CSVDataSource.checkHeaderColumnNames(
+          if (columnPruning) requiredSchema else dataSchema,
+          parser.tokenizer.parseLine(header),


Nit: the following code style is preferred.

val schema = if (columnPruning) requiredSchema else dataSchema val columnNames = parser.tokenizer.parseLine(header) CSVDataSource.checkHeaderColumnNames( schema, columnNames, ...

gatorsmile · 2018-08-20T14:43:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

      .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema")))
  }

+  test("SPARK-25134: check header on parsing of dataset with projection and column pruning") {


Also need a test case for checking enforceSchema works well when column pruning is on.

it seems enforceSchema always kind of "works" because it simply means it ignores the headers.
what do we expect to verify in the test?

SparkQA · 2018-08-20T18:12:51Z

Test build #94959 has finished for PR 22123 at commit f2eb1df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-20T20:41:04Z

Test build #94965 has finished for PR 22123 at commit 667db3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM

HyukjinKwon · 2018-08-21T02:23:39Z

Merged to master.

## What changes were proposed in this pull request? 1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is). - Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc. - See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can. 2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that. - The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated - Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656). 3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is). - Similar reasons above with 1. ## How was this patch tested? Existing tests should cover this. Closes apache#22676 from HyukjinKwon/refactoring-csv. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

koertkuipers added 2 commits August 16, 2018 11:35

if csv column-pruning is turned on header should be checked with requ…

dcd9ac4

…iredSchema not dataSchema

update jira reference in unit test

c4179a9

HyukjinKwon reviewed Aug 17, 2018

View reviewed changes

remove test for check header and projection with column pruning disabled

09c986c

MaxGekk reviewed Aug 18, 2018

View reviewed changes

also check multiLine codepath and selecting empty schema with count

cd18ed2

HyukjinKwon reviewed Aug 20, 2018

View reviewed changes

remove checkHeader and use checkHeaderColumnNames directly

f2eb1df

gatorsmile reviewed Aug 20, 2018

View reviewed changes

use style convention of explicit vals

667db3c

HyukjinKwon approved these changes Aug 21, 2018

View reviewed changes

asfgit closed this in b461acb Aug 21, 2018

HyukjinKwon mentioned this pull request Oct 9, 2018

[SPARK-25684][SQL] Organize header related codes in CSV datasource #22676

Closed

[SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error #22123

[SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error #22123

Uh oh!

Conversation

koertkuipers commented Aug 16, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

HyukjinKwon Aug 17, 2018

Choose a reason for hiding this comment

Uh oh!

koertkuipers Aug 17, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 17, 2018

Uh oh!

koertkuipers commented Aug 17, 2018

Uh oh!

gatorsmile commented Aug 18, 2018

Uh oh!

MaxGekk commented Aug 18, 2018

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk Aug 18, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 19, 2018

Uh oh!

HyukjinKwon Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

koertkuipers Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 20, 2018

Uh oh!

SparkQA commented Aug 20, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon Aug 20, 2018 •

edited

Loading

gatorsmile Aug 20, 2018 •

edited

Loading