[SPARK-25387][SQL] Fix for NPE caused by bad CSV input by MaxGekk · Pull Request #22374 · apache/spark

MaxGekk · 2018-09-09T15:51:12Z

What changes were proposed in this pull request?

The PR fixes NPE in UnivocityParser caused by malformed CSV input. In some cases, uniVocity parser can return null for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers.

How was this patch tested?

I added a test which reproduce the issue and tested by CSVSuite.

SparkQA · 2018-09-09T19:29:10Z

Test build #95848 has finished for PR 22374 at commit c9ccbee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-10T00:00:29Z

This line below possibly returns null?

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Line 510 in 77c9964

val columnNames = parser.parseLine(firstLine)

MaxGekk · 2018-09-10T22:22:35Z

This line below possibly returns null?

@maropu It can return null but inside of CSVDataSource.checkHeaderColumnNames there is a null checking.

maropu · 2018-09-10T23:45:21Z

ok, thanks for the check!

maropu · 2018-09-10T23:53:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+
+  test("SPARK-25387: bad input should not cause NPE") {
+    val schema = StructType(StructField("a", IntegerType) :: Nil)
+    val input = spark.createDataset(Seq("\u0000\u0000\u0001234"))


btw, in this title, bad CSV means what (bad unicode?)? In this case, the CSV parser returns null and, in another case, it throws com.univocity.parsers.common.TextParsingException? I just want to know the behaivour in the parser.

The parseLine method can return null in many cases. See:
https://github.com/uniVocity/univocity-parsers/blob/f616d151b48150bc9cb98943f9b6f8353b704359/src/main/java/com/univocity/parsers/common/AbstractParser.java#L663
https://github.com/uniVocity/univocity-parsers/blob/f616d151b48150bc9cb98943f9b6f8353b704359/src/main/java/com/univocity/parsers/common/AbstractParser.java#L678

It is normal way for the method to indicate about an error.

SparkQA · 2018-09-11T02:33:04Z

Test build #95904 has finished for PR 22374 at commit bd4ebe4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-09-11T08:57:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

-          CSVUtils.filterHeaderLine(filteredLines, firstLine, parsedOptions)
-        val parser = new CsvParser(parsedOptions.asParserSettings)
-        linesWithoutHeader.map(parser.parseLine)
+      if (firstRow != null) {


Can we simplify the code as

maybeFirstLine.map(new CsvParser(parsedOptions.asParserSettings).parseLine(_)) match { case Some(firstRow) if firstRow != null => case _ =>

SparkQA · 2018-09-11T16:13:25Z

Test build #95939 has finished for PR 22374 at commit 2a0dac4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-12T15:18:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala


  private def convert(tokens: Array[String]): InternalRow = {
-    if (tokens.length != parsedSchema.length) {
+    if (tokens == null) {


when will we hit this?

I got it on a CSV file with some marks (a couple zero bytes) at the beginning but uniVocity parser returns null in many cases when it cannot read/parse input, for example: https://github.com/uniVocity/univocity-parsers/blob/f616d151b48150bc9cb98943f9b6f8353b704359/src/main/java/com/univocity/parsers/common/AbstractParser.java#L663

cloud-fan · 2018-09-13T01:52:19Z

thanks, merging to master/2.4!

## What changes were proposed in this pull request? The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers. ## How was this patch tested? I added a test which reproduce the issue and tested by `CSVSuite`. Closes #22374 from MaxGekk/npe-on-bad-csv. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 083c944) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HyukjinKwon

LGTM too

HyukjinKwon · 2018-09-13T02:13:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

+          val parser = new CsvParser(parsedOptions.asParserSettings)
+          linesWithoutHeader.map(parser.parseLine)
+        }
+        CSVInferSchema.infer(tokenRDD, header, parsedOptions)


@MaxGekk, BTW what happen if the second line is the malfromed record and it returns null? From a cursory look, schema inference looks going to throw an NPE exception.

@HyukjinKwon I have checked this on (with header too):

val input = spark.createDataset(Seq("1", "\u0000\u0000\u0001234")) val df = spark.read.option("inferSchema", true).csv(input) df.printSchema() df.show()

root |-- _c0: integer (nullable = true) +----+ | _c0| +----+ | 1| |null| +----+

In the debugger, I didn't observe null in

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

Lines 61 to 69 in 5264164

private def inferRowType(options: CSVOptions)

(rowSoFar: Array[DataType], next: Array[String]): Array[DataType] = {

var i = 0

while (i < math.min(rowSoFar.length, next.length)) { // May have columns on right missing.

rowSoFar(i) = inferField(rowSoFar(i), next(i), options)

i+=1

}

rowSoFar

}

.

## What changes were proposed in this pull request? The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers. ## How was this patch tested? I added a test which reproduce the issue and tested by `CSVSuite`. Closes apache#22374 from MaxGekk/npe-on-bad-csv. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 5 commits September 9, 2018 17:20

Fix NPE in read with specified schema

6f9aba5

Fix NPE in read on schema inferring

9284527

Checking multiLine mode

05fe5fa

Adding ticket number to test's title

b20c12d

Fix imports

c9ccbee

Test refactoring

bd4ebe4

maropu reviewed Sep 10, 2018

View reviewed changes

gengliangwang reviewed Sep 11, 2018

View reviewed changes

Refactoring of handling null from parseLine

2a0dac4

cloud-fan reviewed Sep 12, 2018

View reviewed changes

asfgit closed this in 083c944 Sep 13, 2018

HyukjinKwon reviewed Sep 13, 2018

View reviewed changes

MaxGekk deleted the npe-on-bad-csv branch August 17, 2019 13:33

	private def inferRowType(options: CSVOptions)
	(rowSoFar: Array[DataType], next: Array[String]): Array[DataType] = {
	var i = 0
	while (i < math.min(rowSoFar.length, next.length)) { // May have columns on right missing.
	rowSoFar(i) = inferField(rowSoFar(i), next(i), options)
	i+=1
	}
	rowSoFar
	}

Conversation

MaxGekk commented Sep 9, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 9, 2018

Uh oh!

maropu commented Sep 10, 2018

Uh oh!

MaxGekk commented Sep 10, 2018

Uh oh!

maropu commented Sep 10, 2018

Uh oh!

maropu Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 11, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2018

Uh oh!

gengliangwang Sep 11, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2018

Uh oh!

cloud-fan Sep 12, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 13, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 13, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 13, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MaxGekk Sep 12, 2018 •

edited

Loading