-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16101][SQL] Refactoring CSV data source to be consistent with JSON data source #13988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I still need to correct some nits and check the consistency with JSON data source but I opened this just to check if it breaks anything. I will submit some more commits soon. (and will also update the PR description to be in more details maybe). |
|
Test build #61523 has finished for PR 13988 at commit
|
|
Hi @rxin, I think the change in this PR might be still pretty big. Should I maybe make this separate into two PRs for both reading and writing parts? |
|
Test build #61587 has finished for PR 13988 at commit
|
|
Test build #61588 has finished for PR 13988 at commit
|
|
(@rxin gentle ping..) |
|
Test build #62011 has finished for PR 13988 at commit
|
|
Test build #62014 has finished for PR 13988 at commit
|
|
retest this please |
|
I updated the PR description. I hope this is helpful for reviewing. |
|
Test build #62015 has finished for PR 13988 at commit
|
|
Test build #62016 has finished for PR 13988 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon we ran in to this issue where csv writes ints for DateType instead of date string. (https://issues.apache.org/jira/browse/SPARK-16597)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I opened another PR here, #13912. Maybe it is about that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thanks, commented on that PR. Glad to see someone showing some love to Spark's csv datasource!
|
@rxin Could you take a look please? |
|
Test build #63425 has finished for PR 13988 at commit
|
|
Test build #63482 has finished for PR 13988 at commit
|
|
Test build #64254 has finished for PR 13988 at commit
|
|
Test build #64635 has finished for PR 13988 at commit
|
|
cc @hvanhovell Do you mind if I ask to review this please? I remember the initial proposal was reviewed by you. If this seems too big to review, I can split this into reading path and writing path. |
|
This is also loosely related with https://issues.apache.org/jira/browse/SPARK-15463. After this one is merged, we could resemble the implementation of JSON one easily rather then introducing another refactoring. |
|
Test build #65716 has finished for PR 13988 at commit
|
a6d85b6 to
ac94e67
Compare
|
Test build #66037 has finished for PR 13988 at commit
|
|
Test build #66035 has finished for PR 13988 at commit
|
|
Test build #66038 has finished for PR 13988 at commit
|
|
@hvanhovell If this change looks too big, I will split this into reading path and writing path if you confirm please. |
97eec87 to
697f276
Compare
|
Test build #66939 has finished for PR 13988 at commit
|
|
Test build #66936 has finished for PR 13988 at commit
|
697f276 to
346b1d2
Compare
|
Test build #67391 has finished for PR 13988 at commit
|
|
I will try to split this into two PRs for read path and write path. Would that sound okay to you both @rxin and @hvanhovell? |
346b1d2 to
72e7dcc
Compare
| schema.foreach(field => verifyType(field.dataType)) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These below just came from CSVRelation.
| * 2. Merge row types to find common type | ||
| * 3. Replace any null types with string type | ||
| */ | ||
| def infer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument change is kind of a important change to introduce similar functionalities with JSON. (e,g., creating a dataframe from RDD[String] or Dataset[String]).
| case datum => | ||
| Try(datum.toDouble) | ||
| .getOrElse(NumberFormat.getInstance(Locale.US).parse(datum).doubleValue()) | ||
| private def makeSafeHeader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just came from CSVFileFormat.
|
|
||
| val isCommentSet = this.comment != '\u0000' | ||
|
|
||
| def asWriterSettings: CsvWriterSettings = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These just came from CSVParser.
| writerSettings.setHeaders(schema.fieldNames: _*) | ||
| private val gen = new CsvWriter(writer, writerSettings) | ||
|
|
||
| // A `ValueConverter` is responsible for converting a value of an `InternalRow` to `String`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These below just mostly came from CSVRelation.
| private type ValueConverter = String => Any | ||
|
|
||
| var numMalformedRecords = 0 | ||
| val row = new GenericInternalRow(requiredSchema.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we reuse the single row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it separates numMalformedRecords when it calls parse (...) which looked weird before.
| * each element represents a column) and turns it into either one resulting row or no row (if the | ||
| * the record is malformed). | ||
| */ | ||
| def parse(input: String): Option[InternalRow] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I separate the parsing mode logics (withParseMode) and actual converting logics (parse).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the argument change (matching it up to JacksonParser) is also important. We could avoid additional refactoring when introducing the same funtionalities with JacksonParser, (e.g., loading it from RDD[String] or Dataset[String], from_json and to_json functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, PR - 13300 introduces such refactoring.
|
Test build #70917 has finished for PR 13988 at commit
|
|
Test build #70919 has finished for PR 13988 at commit
|
|
Test build #70920 has finished for PR 13988 at commit
|
|
Test build #70923 has finished for PR 13988 at commit
|
|
Hi @cloud-fan, do you mind if I ask to check whether it looks making sense? |
| options: CSVOptions) extends Logging { | ||
| def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) | ||
|
|
||
| val valueConverters = makeConverters(schema, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes about converting here came from CSVTypeCast.
|
can you split this into smaller PRs? it's really painful to review such a big refactor-only PR. |
|
Sure! Let me split this into reading and writing ones. Thank you for yout comments. Let me close this for now. |
What changes were proposed in this pull request?
This PR refactors CSV data source to be consistent with JSON data source.
This PR removes classes
CSVParserand introduces new classesUnivocityParserandUnivocityGeneratorto be consistent with JSON data source (JacksonParser,JacksonGenerator). Also,CSVRelationis merged withCSVFileFormatjust likeJsonFileFormat.This is a rough look of this change:
CSVOptions- reading/writing settings that can be created from options fromCsvParser.UnivocityGenerator- writing logics fromCSVRelationandCsvParserUnivocityParser- parsing logics inCSVTypeCast,CsvParserandCSVRelationCSVFileFormat-CSVOutputWriterFactoryandCsvOutputWriterinCSVRelationThis PR makes the methods in classes have consistent arguments with JSON ones.
UnivocityGeneratorandJacksonGeneratorUnivocityParserresemblesJacksonParser.csv.CSVInferSchemaandjson.InferSchemaThis PR also makes the classes put in together in a consistent manner with JSON.
CsvFileFormatJsonFileFormatAlso, this re-write existing CSV parsing logics to re-use the row, separate parsing mode logic/convering logics and etc.
How was this patch tested?
Existing tests should cover this.