-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-25387][SQL] Fix for NPE caused by bad CSV input #22374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6f9aba5
9284527
05fe5fa
b20c12d
c9ccbee
bd4ebe4
2a0dac4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -216,7 +216,12 @@ class UnivocityParser( | |
| } | ||
|
|
||
| private def convert(tokens: Array[String]): InternalRow = { | ||
| if (tokens.length != parsedSchema.length) { | ||
| if (tokens == null) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when will we hit this?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got it on a CSV file with some marks (a couple zero bytes) at the beginning but |
||
| throw BadRecordException( | ||
| () => getCurrentInput, | ||
| () => None, | ||
| new RuntimeException("Malformed CSV record")) | ||
| } else if (tokens.length != parsedSchema.length) { | ||
| // If the number of tokens doesn't match the schema, we should treat it as a malformed record. | ||
| // However, we still have chance to parse some of the tokens, by adding extra null tokens in | ||
| // the tail if the number is smaller, or by dropping extra tokens if the number is larger. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,7 +34,7 @@ import org.apache.log4j.{AppenderSkeleton, LogManager} | |
| import org.apache.log4j.spi.LoggingEvent | ||
|
|
||
| import org.apache.spark.SparkException | ||
| import org.apache.spark.sql.{AnalysisException, DataFrame, QueryTest, Row, UDT} | ||
| import org.apache.spark.sql.{AnalysisException, DataFrame, QueryTest, Row} | ||
| import org.apache.spark.sql.catalyst.util.DateTimeUtils | ||
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} | ||
|
|
@@ -1700,4 +1700,13 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te | |
| checkCount(2) | ||
| countForMalformedCSV(0, Seq("")) | ||
| } | ||
|
|
||
| test("SPARK-25387: bad input should not cause NPE") { | ||
| val schema = StructType(StructField("a", IntegerType) :: Nil) | ||
| val input = spark.createDataset(Seq("\u0000\u0000\u0001234")) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. btw, in this title, bad CSV means what (bad unicode?)? In this case, the CSV parser returns null and, in another case, it throws
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The It is normal way for the method to indicate about an error. |
||
|
|
||
| checkAnswer(spark.read.schema(schema).csv(input), Row(null)) | ||
| checkAnswer(spark.read.option("multiLine", true).schema(schema).csv(input), Row(null)) | ||
| assert(spark.read.csv(input).collect().toSet == Set(Row())) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, BTW what happen if the second line is the malfromed record and it returns
null? From a cursory look, schema inference looks going to throw an NPE exception.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I have checked this on (with header too):
In the debugger, I didn't observe
nullinspark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
Lines 61 to 69 in 5264164