-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error #22123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error #22123
Conversation
…iredSchema not dataSchema
|
Test build #94854 has finished for PR 22123 at commit
|
| } | ||
|
|
||
| test("SPARK-25134: check header on parsing of dataset with projection and no column pruning") { | ||
| withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think false case test can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will remove
|
Test build #94875 has finished for PR 22123 at commit
|
failure seems unrelated to this pullreq |
|
cc @MaxGekk |
|
May I ask you check the spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala Lines 303 to 307 in a8a1ac0
|
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, it would be nice add a couple tests
| .option("header", true) | ||
| .option("enforceSchema", false) | ||
| .load(dir) | ||
| .select("columnA"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check a corner case when required Schema is empty. For example, .option("enforceSchema", false) + count().
|
Test build #94935 has finished for PR 22123 at commit
|
| // Note: if there are only comments in the first block, the header would probably | ||
| // be not extracted. | ||
| CSVUtils.extractHeader(lines, parser.options).foreach { header => | ||
| CSVDataSource.checkHeader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove CSVDataSource.checkHeader and switch to CSVDataSource.checkHeaderColumnNames? Looks CSVDataSource.checkHeader is an overkill and makes hard to read the code.
| dataSchema, | ||
| CSVDataSource.checkHeaderColumnNames( | ||
| if (columnPruning) requiredSchema else dataSchema, | ||
| parser.tokenizer.parseLine(header), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the following code style is preferred.
val schema = if (columnPruning) requiredSchema else dataSchema
val columnNames = parser.tokenizer.parseLine(header)
CSVDataSource.checkHeaderColumnNames(
schema,
columnNames,
...
| .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) | ||
| } | ||
|
|
||
| test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need a test case for checking enforceSchema works well when column pruning is on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems enforceSchema always kind of "works" because it simply means it ignores the headers.
what do we expect to verify in the test?
|
Test build #94959 has finished for PR 22123 at commit
|
|
Test build #94965 has finished for PR 22123 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master. |
## What changes were proposed in this pull request?
1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is).
- Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc.
- See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can.
2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that.
- The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated
- Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656).
3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is).
- Similar reasons above with 1.
## How was this patch tested?
Existing tests should cover this.
Closes apache#22676 from HyukjinKwon/refactoring-csv.
Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
## What changes were proposed in this pull request?
1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is).
- Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc.
- See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can.
2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that.
- The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated
- Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656).
3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is).
- Similar reasons above with 1.
## How was this patch tested?
Existing tests should cover this.
Closes apache#22676 from HyukjinKwon/refactoring-csv.
Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
When column pruning is turned on the checking of headers in the csv should only be for the fields in the requiredSchema, not the dataSchema, because column pruning means only requiredSchema is read.
How was this patch tested?
Added 2 unit tests where column pruning is turned on/off and csv headers are checked againt schema
Please review http://spark.apache.org/contributing.html before opening a pull request.