-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19783][SQL] Treat longer lengths of tokens as malformed records in CSV parser #17136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Oh, @maropu, I have been looking into R's with the data below: > read.csv("test.csv")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :with the data below: > read.csv("test.csv")
a b c d e d.1 d.2
1 a b c NA NA NA NASo, IMHO, we might better follow R's I am actually a bit worried of behaviour change because Another concern is, it seems we should produce |
|
Thanks, your comment! Aha, I see. Yes, some arguable about shorter lengths of tokens though, I think we need treat longer length of tokens as malformed ones because dropping some tokens leads to loss of information. Thought? |
|
Test build #73763 has finished for PR 17136 at commit
|
|
Test build #73768 has finished for PR 17136 at commit
|
|
Jenkins, retest this please. |
|
Test build #73816 has finished for PR 17136 at commit
|
|
Test build #73836 has finished for PR 17136 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there not a way to support variable number of values (and commas) in csv row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, so I think i'ld be better to take the longer cases in this pr, and then we need to discuss more about the shorter cases in another pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a way for it after we clean up and define the behaviour about parse mode..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a regression in behavior that could affect users. If we are to consider parse mode flag, that flag should default to be backward compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably think that dropping the extra tokens in the longer case is an incorrect behaviour by referring the json behaviour. But, I know this change could affect current users, so we might need to do something for that, e.g., adding a new option to keep the current behaviour. WDYT? cc: @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu, I left some opinions on the codes. I think we should produce columnNameOfCorruptRecord in schema inference as we are doing in JSON datasource if we will treat those tokens as malformed ones in PERMISSIVE mode.
Another thought is, we should at least resembles R's read.csv behaviour in terms of malformed records (let's de-duplicate the efforts to judge the right behaviour). So, it seems longer ones are only considered as malformed ones? FWIW, I am okay if it follows R's one and if these are in the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this
else {
if {
...
} else {
...
}
}
to
else if {
...
} else {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry, but my latest commit seems to conflict with your review timing. In the latest, this issue fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe SPARK-19783 :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
|
@HyukjinKwon Thanks for your comment! Yea, I agree with your opinion; we'd be better to treat longer ones as malformed and make the behaviour of shorter ones the same with R's behaviour. |
|
Test build #73929 has finished for PR 17136 at commit
|
|
Test build #73931 has finished for PR 17136 at commit
|
|
Hi, |
|
@alexz00 Thanks for your comment. Since the longer case is not much arguable, I think we probably could fix the longer case in this pr. However, the shorter case is very arguable, so IMHO we need to collect other's opinions and discuss more in follow-up JIRA ticket. Anyway, this decision depends on qualified guys. |
|
Test build #74541 has finished for PR 17136 at commit
|
This reverts commit d88a96684f50ad8674d0f1c6ad4a5f68faf271b4.
…n CSV parser" This reverts commit aa290ee32ef09d6d018f261c3bccb85d08259ac5.
|
Test build #74924 has finished for PR 17136 at commit
|
|
This pr gets stale because of the refactoring in #17315. So, I'll close for now. Thanks! |
What changes were proposed in this pull request?
If a length of tokens does not match an expected length in a schema, we need to treat it as a malformed record. This pr modified code to handle these records as malformed.
This is a TODO task: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239
How was this patch tested?
Modified some existing tests and added new ones in
CSVSuite.