-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25669][SQL] Check CSV header only when it exists #22656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #97050 has finished for PR 22656 at commit
|
|
jenkins, retest this, please |
|
Test build #97053 has finished for PR 22656 at commit
|
|
@HyukjinKwon Could you look at the PR, please. |
| StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord)) | ||
|
|
||
| val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine => | ||
| val linesWithoutHeader = if (parsedOptions.headerFlag && maybeFirstLine.isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but it really needs some refactoring. Let me give a shot
## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes #22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> (cherry picked from commit 46fe408) Signed-off-by: hyukjinkwon <[email protected]>
|
Merged to master and branch-2.4. |
## What changes were proposed in this pull request?
1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is).
- Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc.
- See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can.
2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that.
- The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated
- Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656).
3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is).
- Similar reasons above with 1.
## How was this patch tested?
Existing tests should cover this.
Closes apache#22676 from HyukjinKwon/refactoring-csv.
Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes apache#22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
## What changes were proposed in this pull request?
1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is).
- Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc.
- See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can.
2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that.
- The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated
- Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656).
3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is).
- Similar reasons above with 1.
## How was this patch tested?
Existing tests should cover this.
Closes apache#22676 from HyukjinKwon/refactoring-csv.
Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing
"1,2"outputs the error:In the PR, I propose:
How was this patch tested?
Added a test to
CSVSuitewhich reproduces the issue.