[SPARK-25669][SQL] Check CSV header only when it exists #22656

MaxGekk · 2018-10-06T13:09:02Z

What changes were proposed in this pull request?

Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing "1,2" outputs the error:

java.lang.IllegalArgumentException: CSV header does not conform to the schema.
 Header: 1, 2
 Schema: _c0, _c1
Expected: _c0 but found: 1

In the PR, I propose:

Checking CSV header only when it exists
Filter header from the input dataset only if it exists

How was this patch tested?

Added a test to CSVSuite which reproduces the issue.

SparkQA · 2018-10-06T14:50:42Z

Test build #97050 has finished for PR 22656 at commit 676e558.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-10-06T15:28:16Z

jenkins, retest this, please

SparkQA · 2018-10-06T19:11:51Z

Test build #97053 has finished for PR 22656 at commit 676e558.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-10-08T22:02:17Z

@HyukjinKwon Could you look at the PR, please.

HyukjinKwon · 2018-10-09T05:50:30Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

      StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord))

-    val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
+    val linesWithoutHeader = if (parsedOptions.headerFlag && maybeFirstLine.isDefined) {


LGTM but it really needs some refactoring. Let me give a shot

## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes #22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> (cherry picked from commit 46fe408) Signed-off-by: hyukjinkwon <[email protected]>

HyukjinKwon · 2018-10-09T08:39:07Z

Merged to master and branch-2.4.

## What changes were proposed in this pull request? 1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is). - Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc. - See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can. 2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that. - The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated - Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656). 3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is). - Similar reasons above with 1. ## How was this patch tested? Existing tests should cover this. Closes apache#22676 from HyukjinKwon/refactoring-csv. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes apache#22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

## What changes were proposed in this pull request? 1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is). - Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc. - See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can. 2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that. - The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated - Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656). 3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is). - Similar reasons above with 1. ## How was this patch tested? Existing tests should cover this. Closes apache#22676 from HyukjinKwon/refactoring-csv. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

Don't need to check inferred field names to the first row

676e558

HyukjinKwon reviewed Oct 9, 2018

View reviewed changes

asfgit closed this in 46fe408 Oct 9, 2018

HyukjinKwon mentioned this pull request Oct 9, 2018

[SPARK-25684][SQL] Organize header related codes in CSV datasource #22676

Closed

MaxGekk deleted the inferred-header-check branch August 17, 2019 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25669][SQL] Check CSV header only when it exists #22656

[SPARK-25669][SQL] Check CSV header only when it exists #22656

Uh oh!

MaxGekk commented Oct 6, 2018

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

MaxGekk commented Oct 6, 2018

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

MaxGekk commented Oct 8, 2018

Uh oh!

HyukjinKwon Oct 9, 2018

Uh oh!

HyukjinKwon commented Oct 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-25669][SQL] Check CSV header only when it exists #22656

[SPARK-25669][SQL] Check CSV header only when it exists #22656

Uh oh!

Conversation

MaxGekk commented Oct 6, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

MaxGekk commented Oct 6, 2018

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

MaxGekk commented Oct 8, 2018

Uh oh!

HyukjinKwon Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants