Skip to content

Conversation

@koertkuipers
Copy link
Contributor

What changes were proposed in this pull request?

When column pruning is turned on the checking of headers in the csv should only be for the fields in the requiredSchema, not the dataSchema, because column pruning means only requiredSchema is read.

How was this patch tested?

Added 2 unit tests where column pruning is turned on/off and csv headers are checked againt schema

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94854 has finished for PR 22123 at commit c4179a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

test("SPARK-25134: check header on parsing of dataset with projection and no column pruning") {
withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think false case test can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will remove

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94875 has finished for PR 22123 at commit 09c986c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@koertkuipers
Copy link
Contributor Author

Test Result (1 failure / +1)
    org.apache.spark.sql.streaming.FlatMapGroupsWithStateSuite.flatMapGroupsWithState - streaming with processing time timeout - state format version 1

failure seems unrelated to this pullreq

@gatorsmile
Copy link
Member

cc @MaxGekk

@MaxGekk
Copy link
Member

MaxGekk commented Aug 18, 2018

May I ask you check the multiLine mode additionally since we use different methods of uniVocity parser. When multiLine is disabled, the parseLine method is used but in the multiLine mode:

if (shouldDropHeader) {
val firstRecord = tokenizer.parseNext()
checkHeader(firstRecord)
}
tokenizer.parseNext()

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it would be nice add a couple tests

.option("header", true)
.option("enforceSchema", false)
.load(dir)
.select("columnA"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check a corner case when required Schema is empty. For example, .option("enforceSchema", false) + count().

@SparkQA
Copy link

SparkQA commented Aug 19, 2018

Test build #94935 has finished for PR 22123 at commit cd18ed2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Note: if there are only comments in the first block, the header would probably
// be not extracted.
CSVUtils.extractHeader(lines, parser.options).foreach { header =>
CSVDataSource.checkHeader(
Copy link
Member

@HyukjinKwon HyukjinKwon Aug 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove CSVDataSource.checkHeader and switch to CSVDataSource.checkHeaderColumnNames? Looks CSVDataSource.checkHeader is an overkill and makes hard to read the code.

dataSchema,
CSVDataSource.checkHeaderColumnNames(
if (columnPruning) requiredSchema else dataSchema,
parser.tokenizer.parseLine(header),
Copy link
Member

@gatorsmile gatorsmile Aug 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the following code style is preferred.

val schema = if (columnPruning) requiredSchema else dataSchema
val columnNames = parser.tokenizer.parseLine(header)
CSVDataSource.checkHeaderColumnNames(
  schema,
  columnNames,
  ...

.exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema")))
}

test("SPARK-25134: check header on parsing of dataset with projection and column pruning") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need a test case for checking enforceSchema works well when column pruning is on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems enforceSchema always kind of "works" because it simply means it ignores the headers.
what do we expect to verify in the test?

@SparkQA
Copy link

SparkQA commented Aug 20, 2018

Test build #94959 has finished for PR 22123 at commit f2eb1df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2018

Test build #94965 has finished for PR 22123 at commit 667db3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in b461acb Aug 21, 2018
Halo9Pan pushed a commit to Halo9Pan/dive-spark that referenced this pull request Oct 12, 2018
## What changes were proposed in this pull request?

1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is).

    - Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc.

    - See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can.

2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that.

    - The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated

    - Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656).

3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is).

    - Similar reasons above with 1.

## How was this patch tested?

Existing tests should cover this.

Closes apache#22676 from HyukjinKwon/refactoring-csv.

Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is).

    - Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc.

    - See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can.

2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that.

    - The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated

    - Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656).

3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is).

    - Similar reasons above with 1.

## How was this patch tested?

Existing tests should cover this.

Closes apache#22676 from HyukjinKwon/refactoring-csv.

Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants