[SPARK-19783][SQL] Treat longer lengths of tokens as malformed records in CSV parser #17136

maropu · 2017-03-02T12:26:35Z

What changes were proposed in this pull request?

If a length of tokens does not match an expected length in a schema, we need to treat it as a malformed record. This pr modified code to handle these records as malformed.
This is a TODO task: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239

How was this patch tested?

Modified some existing tests and added new ones in CSVSuite.

HyukjinKwon · 2017-03-02T12:51:45Z

Oh, @maropu, I have been looking into R's read.csv before adding some comments in the JIRAs.

with the data below:

a,b,c
a,b,c,d,e,d,d

> read.csv("test.csv")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :

with the data below:

a,b,c,d,e,d,d
a,b,c

> read.csv("test.csv")
  a b c  d  e d.1 d.2
1 a b c NA NA  NA  NA

So, IMHO, we might better follow R's read.csv for now but of course I guess we should take a look for other libraries.

I am actually a bit worried of behaviour change because PERMISSIVE has been the default mode and since it was as the thirdparty library (Spark 1.3+).

Another concern is, it seems we should produce columnNameOfCorruptRecord in schema inference as we are doing in JSON datasource if we will treat those tokens as malformed ones in PERMISSIVE mode.

maropu · 2017-03-02T12:57:39Z

Thanks, your comment! Aha, I see. Yes, some arguable about shorter lengths of tokens though, I think we need treat longer length of tokens as malformed ones because dropping some tokens leads to loss of information. Thought?

SparkQA · 2017-03-02T13:51:11Z

Test build #73763 has finished for PR 17136 at commit aa290ee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-02T15:56:28Z

Test build #73768 has finished for PR 17136 at commit 5a01a9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-03-03T05:47:26Z

Jenkins, retest this please.

SparkQA · 2017-03-03T07:43:53Z

Test build #73816 has finished for PR 17136 at commit 5a01a9d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-03T10:37:00Z

Test build #73836 has finished for PR 17136 at commit d88a966.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-03-05T01:06:46Z

R/pkg/inst/tests/testthat/test_sparkSQL.R

is there not a way to support variable number of values (and commas) in csv row?

yea, so I think i'ld be better to take the longer cases in this pr, and then we need to discuss more about the shorter cases in another pr.

We might need a way for it after we clean up and define the behaviour about parse mode..

yea, I agree.

I think this is a regression in behavior that could affect users. If we are to consider parse mode flag, that flag should default to be backward compatible

I probably think that dropping the extra tokens in the longer case is an incorrect behaviour by referring the json behaviour. But, I know this change could affect current users, so we might need to do something for that, e.g., adding a new option to keep the current behaviour. WDYT? cc: @HyukjinKwon

HyukjinKwon

@maropu, I left some opinions on the codes. I think we should produce columnNameOfCorruptRecord in schema inference as we are doing in JSON datasource if we will treat those tokens as malformed ones in PERMISSIVE mode.

Another thought is, we should at least resembles R's read.csv behaviour in terms of malformed records (let's de-duplicate the efforts to judge the right behaviour). So, it seems longer ones are only considered as malformed ones? FWIW, I am okay if it follows R's one and if these are in the release notes.

HyukjinKwon · 2017-03-05T10:48:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

Could we make this

else { if { ... } else { ... } }

to

else if { ... } else { ... }

oh, sorry, but my latest commit seems to conflict with your review timing. In the latest, this issue fixed.

HyukjinKwon · 2017-03-05T10:49:18Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Maybe SPARK-19783 :).

maropu · 2017-03-05T11:31:06Z

@HyukjinKwon Thanks for your comment! Yea, I agree with your opinion; we'd be better to treat longer ones as malformed and make the behaviour of shorter ones the same with R's behaviour.

SparkQA · 2017-03-05T12:40:10Z

Test build #73929 has finished for PR 17136 at commit 4ece983.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-05T14:59:45Z

Test build #73931 has finished for PR 17136 at commit 073a5cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

alexz00 · 2017-03-14T12:19:25Z

Hi,
I have some concerns about not treating shorter records as malformed ones: this could lead to corrupt/inconsistent data, since there is no reason why a record's missing tokens can not be 'in the middle' and not at the end of the record.
I think that at least it would be useful to add an option to define a policy for this.
If you think it is better, I can open an issue for this enhancement.

maropu · 2017-03-14T16:36:31Z

@alexz00 Thanks for your comment. Since the longer case is not much arguable, I think we probably could fix the longer case in this pr. However, the shorter case is very arguable, so IMHO we need to collect other's opinions and discuss more in follow-up JIRA ticket. Anyway, this decision depends on qualified guys.

SparkQA · 2017-03-14T18:33:06Z

Test build #74541 has finished for PR 17136 at commit 8d83985.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…rser

This reverts commit d88a96684f50ad8674d0f1c6ad4a5f68faf271b4.

…n CSV parser" This reverts commit aa290ee32ef09d6d018f261c3bccb85d08259ac5.

SparkQA · 2017-03-21T03:24:51Z

Test build #74924 has finished for PR 17136 at commit 3ff3d3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-03-21T08:33:53Z

This pr gets stale because of the refactoring in #17315. So, I'll close for now. Thanks!

felixcheung reviewed Mar 5, 2017

View reviewed changes

HyukjinKwon reviewed Mar 5, 2017

View reviewed changes

maropu changed the title ~~[SPARK-19783][SQL] Treat shorter/longer lengths of tokens as malformed records in CSV parser~~ [SPARK-19783][SQL] Treat longer lengths of tokens as malformed records in CSV parser Mar 5, 2017

maropu force-pushed the SPARK-19783 branch from 4ece983 to 073a5cb Compare March 5, 2017 13:02

maropu force-pushed the SPARK-19783 branch from 073a5cb to 8d83985 Compare March 14, 2017 16:27

maropu added 5 commits March 21, 2017 11:12

Treat shorter/longer lengths of tokens as malformed records in CSV pa…

8169741

…rser

Fix errors in R tests

4b9b2bd

Revert "Fix errors in R tests"

3829042

This reverts commit d88a96684f50ad8674d0f1c6ad4a5f68faf271b4.

Revert "Treat shorter/longer lengths of tokens as malformed records i…

0319170

…n CSV parser" This reverts commit aa290ee32ef09d6d018f261c3bccb85d08259ac5.

Treat longer lengths of tokens as malformed records in CSV parser

3ff3d3f

maropu force-pushed the SPARK-19783 branch from 8d83985 to 3ff3d3f Compare March 21, 2017 02:16

maropu closed this Mar 21, 2017

[SPARK-19783][SQL] Treat longer lengths of tokens as malformed records in CSV parser #17136

[SPARK-19783][SQL] Treat longer lengths of tokens as malformed records in CSV parser #17136

Uh oh!

Conversation

maropu commented Mar 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

maropu commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Mar 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Mar 5, 2017

Uh oh!

SparkQA commented Mar 5, 2017

Uh oh!

SparkQA commented Mar 5, 2017

Uh oh!

alexz00 commented Mar 14, 2017

Uh oh!

maropu commented Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 21, 2017

Uh oh!

maropu commented Mar 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Mar 2, 2017 •

edited

Loading

maropu commented Mar 2, 2017 •

edited

Loading

maropu Mar 5, 2017 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

maropu commented Mar 14, 2017 •

edited

Loading