Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Mar 2, 2017

What changes were proposed in this pull request?

If a length of tokens does not match an expected length in a schema, we need to treat it as a malformed record. This pr modified code to handle these records as malformed.
This is a TODO task: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239

How was this patch tested?

Modified some existing tests and added new ones in CSVSuite.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 2, 2017

Oh, @maropu, I have been looking into R's read.csv before adding some comments in the JIRAs.

with the data below:

a,b,c
a,b,c,d,e,d,d
> read.csv("test.csv")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :

with the data below:

a,b,c,d,e,d,d
a,b,c
> read.csv("test.csv")
  a b c  d  e d.1 d.2
1 a b c NA NA  NA  NA

So, IMHO, we might better follow R's read.csv for now but of course I guess we should take a look for other libraries.

I am actually a bit worried of behaviour change because PERMISSIVE has been the default mode and since it was as the thirdparty library (Spark 1.3+).

Another concern is, it seems we should produce columnNameOfCorruptRecord in schema inference as we are doing in JSON datasource if we will treat those tokens as malformed ones in PERMISSIVE mode.

@maropu
Copy link
Member Author

maropu commented Mar 2, 2017

Thanks, your comment! Aha, I see. Yes, some arguable about shorter lengths of tokens though, I think we need treat longer length of tokens as malformed ones because dropping some tokens leads to loss of information. Thought?

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73763 has finished for PR 17136 at commit aa290ee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73768 has finished for PR 17136 at commit 5a01a9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 3, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73816 has finished for PR 17136 at commit 5a01a9d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73836 has finished for PR 17136 at commit d88a966.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there not a way to support variable number of values (and commas) in csv row?

Copy link
Member Author

@maropu maropu Mar 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, so I think i'ld be better to take the longer cases in this pr, and then we need to discuss more about the shorter cases in another pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a way for it after we clean up and define the behaviour about parse mode..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a regression in behavior that could affect users. If we are to consider parse mode flag, that flag should default to be backward compatible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably think that dropping the extra tokens in the longer case is an incorrect behaviour by referring the json behaviour. But, I know this change could affect current users, so we might need to do something for that, e.g., adding a new option to keep the current behaviour. WDYT? cc: @HyukjinKwon

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu, I left some opinions on the codes. I think we should produce columnNameOfCorruptRecord in schema inference as we are doing in JSON datasource if we will treat those tokens as malformed ones in PERMISSIVE mode.

Another thought is, we should at least resembles R's read.csv behaviour in terms of malformed records (let's de-duplicate the efforts to judge the right behaviour). So, it seems longer ones are only considered as malformed ones? FWIW, I am okay if it follows R's one and if these are in the release notes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this

else {
  if {
    ...
  } else {
    ...
  }
}

to

else if {
  ...
} else {
  ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry, but my latest commit seems to conflict with your review timing. In the latest, this issue fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe SPARK-19783 :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@maropu maropu changed the title [SPARK-19783][SQL] Treat shorter/longer lengths of tokens as malformed records in CSV parser [SPARK-19783][SQL] Treat longer lengths of tokens as malformed records in CSV parser Mar 5, 2017
@maropu
Copy link
Member Author

maropu commented Mar 5, 2017

@HyukjinKwon Thanks for your comment! Yea, I agree with your opinion; we'd be better to treat longer ones as malformed and make the behaviour of shorter ones the same with R's behaviour.

@SparkQA
Copy link

SparkQA commented Mar 5, 2017

Test build #73929 has finished for PR 17136 at commit 4ece983.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2017

Test build #73931 has finished for PR 17136 at commit 073a5cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@alexz00
Copy link

alexz00 commented Mar 14, 2017

Hi,
I have some concerns about not treating shorter records as malformed ones: this could lead to corrupt/inconsistent data, since there is no reason why a record's missing tokens can not be 'in the middle' and not at the end of the record.
I think that at least it would be useful to add an option to define a policy for this.
If you think it is better, I can open an issue for this enhancement.

@maropu
Copy link
Member Author

maropu commented Mar 14, 2017

@alexz00 Thanks for your comment. Since the longer case is not much arguable, I think we probably could fix the longer case in this pr. However, the shorter case is very arguable, so IMHO we need to collect other's opinions and discuss more in follow-up JIRA ticket. Anyway, this decision depends on qualified guys.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74541 has finished for PR 17136 at commit 8d83985.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74924 has finished for PR 17136 at commit 3ff3d3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 21, 2017

This pr gets stale because of the refactoring in #17315. So, I'll close for now. Thanks!

@maropu maropu closed this Mar 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants