Skip to content

[SPARK-25387][SQL] Fix for NPE caused by bad CSV input#22374

Closed
MaxGekk wants to merge 7 commits intoapache:masterfrom
MaxGekk:npe-on-bad-csv
Closed

[SPARK-25387][SQL] Fix for NPE caused by bad CSV input#22374
MaxGekk wants to merge 7 commits intoapache:masterfrom
MaxGekk:npe-on-bad-csv

Conversation

@MaxGekk
Copy link
Copy Markdown
Member

@MaxGekk MaxGekk commented Sep 9, 2018

What changes were proposed in this pull request?

The PR fixes NPE in UnivocityParser caused by malformed CSV input. In some cases, uniVocity parser can return null for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers.

How was this patch tested?

I added a test which reproduce the issue and tested by CSVSuite.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Sep 9, 2018

Test build #95848 has finished for PR 22374 at commit c9ccbee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Copy Markdown
Member

maropu commented Sep 10, 2018

This line below possibly returns null?

val columnNames = parser.parseLine(firstLine)

@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented Sep 10, 2018

This line below possibly returns null?

@maropu It can return null but inside of CSVDataSource.checkHeaderColumnNames there is a null checking.

@maropu
Copy link
Copy Markdown
Member

maropu commented Sep 10, 2018

ok, thanks for the check!


test("SPARK-25387: bad input should not cause NPE") {
val schema = StructType(StructField("a", IntegerType) :: Nil)
val input = spark.createDataset(Seq("\u0000\u0000\u0001234"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, in this title, bad CSV means what (bad unicode?)? In this case, the CSV parser returns null and, in another case, it throws com.univocity.parsers.common.TextParsingException? I just want to know the behaivour in the parser.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Sep 11, 2018

Test build #95904 has finished for PR 22374 at commit bd4ebe4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

CSVUtils.filterHeaderLine(filteredLines, firstLine, parsedOptions)
val parser = new CsvParser(parsedOptions.asParserSettings)
linesWithoutHeader.map(parser.parseLine)
if (firstRow != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify the code as

    maybeFirstLine.map(new CsvParser(parsedOptions.asParserSettings).parseLine(_)) match {
      case Some(firstRow) if firstRow != null =>
      case _ =>

@SparkQA
Copy link
Copy Markdown

SparkQA commented Sep 11, 2018

Test build #95939 has finished for PR 22374 at commit 2a0dac4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private def convert(tokens: Array[String]): InternalRow = {
if (tokens.length != parsedSchema.length) {
if (tokens == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will we hit this?

Copy link
Copy Markdown
Member Author

@MaxGekk MaxGekk Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it on a CSV file with some marks (a couple zero bytes) at the beginning but uniVocity parser returns null in many cases when it cannot read/parse input, for example: https://github.com/uniVocity/univocity-parsers/blob/f616d151b48150bc9cb98943f9b6f8353b704359/src/main/java/com/univocity/parsers/common/AbstractParser.java#L663

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master/2.4!

asfgit pushed a commit that referenced this pull request Sep 13, 2018
## What changes were proposed in this pull request?

The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers.

## How was this patch tested?

I added a test which reproduce the issue and tested by `CSVSuite`.

Closes #22374 from MaxGekk/npe-on-bad-csv.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 083c944)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in 083c944 Sep 13, 2018
Copy link
Copy Markdown
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

val parser = new CsvParser(parsedOptions.asParserSettings)
linesWithoutHeader.map(parser.parseLine)
}
CSVInferSchema.infer(tokenRDD, header, parsedOptions)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, BTW what happen if the second line is the malfromed record and it returns null? From a cursory look, schema inference looks going to throw an NPE exception.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I have checked this on (with header too):

    val input = spark.createDataset(Seq("1", "\u0000\u0000\u0001234"))

    val df = spark.read.option("inferSchema", true).csv(input)
    df.printSchema()
    df.show()
root
 |-- _c0: integer (nullable = true)

+----+
| _c0|
+----+
|   1|
|null|
+----+

In the debugger, I didn't observe null in

private def inferRowType(options: CSVOptions)
(rowSoFar: Array[DataType], next: Array[String]): Array[DataType] = {
var i = 0
while (i < math.min(rowSoFar.length, next.length)) { // May have columns on right missing.
rowSoFar(i) = inferField(rowSoFar(i), next(i), options)
i+=1
}
rowSoFar
}
.

fjh100456 pushed a commit to fjh100456/spark that referenced this pull request Sep 13, 2018
## What changes were proposed in this pull request?

The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers.

## How was this patch tested?

I added a test which reproduce the issue and tested by `CSVSuite`.

Closes apache#22374 from MaxGekk/npe-on-bad-csv.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the npe-on-bad-csv branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants