[SPARK-25387][SQL] Fix for NPE caused by bad CSV input#22374
[SPARK-25387][SQL] Fix for NPE caused by bad CSV input#22374MaxGekk wants to merge 7 commits intoapache:masterfrom
Conversation
|
Test build #95848 has finished for PR 22374 at commit
|
|
This line below possibly returns null? |
@maropu It can return |
|
ok, thanks for the check! |
|
|
||
| test("SPARK-25387: bad input should not cause NPE") { | ||
| val schema = StructType(StructField("a", IntegerType) :: Nil) | ||
| val input = spark.createDataset(Seq("\u0000\u0000\u0001234")) |
There was a problem hiding this comment.
btw, in this title, bad CSV means what (bad unicode?)? In this case, the CSV parser returns null and, in another case, it throws com.univocity.parsers.common.TextParsingException? I just want to know the behaivour in the parser.
There was a problem hiding this comment.
The parseLine method can return null in many cases. See:
https://github.com/uniVocity/univocity-parsers/blob/f616d151b48150bc9cb98943f9b6f8353b704359/src/main/java/com/univocity/parsers/common/AbstractParser.java#L663
https://github.com/uniVocity/univocity-parsers/blob/f616d151b48150bc9cb98943f9b6f8353b704359/src/main/java/com/univocity/parsers/common/AbstractParser.java#L678
It is normal way for the method to indicate about an error.
|
Test build #95904 has finished for PR 22374 at commit
|
| CSVUtils.filterHeaderLine(filteredLines, firstLine, parsedOptions) | ||
| val parser = new CsvParser(parsedOptions.asParserSettings) | ||
| linesWithoutHeader.map(parser.parseLine) | ||
| if (firstRow != null) { |
There was a problem hiding this comment.
Can we simplify the code as
maybeFirstLine.map(new CsvParser(parsedOptions.asParserSettings).parseLine(_)) match {
case Some(firstRow) if firstRow != null =>
case _ =>
|
Test build #95939 has finished for PR 22374 at commit
|
|
|
||
| private def convert(tokens: Array[String]): InternalRow = { | ||
| if (tokens.length != parsedSchema.length) { | ||
| if (tokens == null) { |
There was a problem hiding this comment.
I got it on a CSV file with some marks (a couple zero bytes) at the beginning but uniVocity parser returns null in many cases when it cannot read/parse input, for example: https://github.com/uniVocity/univocity-parsers/blob/f616d151b48150bc9cb98943f9b6f8353b704359/src/main/java/com/univocity/parsers/common/AbstractParser.java#L663
|
thanks, merging to master/2.4! |
## What changes were proposed in this pull request? The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers. ## How was this patch tested? I added a test which reproduce the issue and tested by `CSVSuite`. Closes #22374 from MaxGekk/npe-on-bad-csv. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 083c944) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
| val parser = new CsvParser(parsedOptions.asParserSettings) | ||
| linesWithoutHeader.map(parser.parseLine) | ||
| } | ||
| CSVInferSchema.infer(tokenRDD, header, parsedOptions) |
There was a problem hiding this comment.
@MaxGekk, BTW what happen if the second line is the malfromed record and it returns null? From a cursory look, schema inference looks going to throw an NPE exception.
There was a problem hiding this comment.
@HyukjinKwon I have checked this on (with header too):
val input = spark.createDataset(Seq("1", "\u0000\u0000\u0001234"))
val df = spark.read.option("inferSchema", true).csv(input)
df.printSchema()
df.show()root
|-- _c0: integer (nullable = true)
+----+
| _c0|
+----+
| 1|
|null|
+----+
In the debugger, I didn't observe null in
## What changes were proposed in this pull request? The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers. ## How was this patch tested? I added a test which reproduce the issue and tested by `CSVSuite`. Closes apache#22374 from MaxGekk/npe-on-bad-csv. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
The PR fixes NPE in
UnivocityParsercaused by malformed CSV input. In some cases,uniVocityparser can returnnullfor bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers.How was this patch tested?
I added a test which reproduce the issue and tested by
CSVSuite.