-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26108][SQL] Support custom lineSep in CSV datasource #23080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 15 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
a790bb3
Added a test for default line separator
MaxGekk 7a47990
Test for custom lineSep
MaxGekk be2870f
Test on read
MaxGekk a058a6f
Support lineSep in write
MaxGekk 7e3c026
Check roundtrip
MaxGekk 486b090
Test another char
MaxGekk a0fedbb
Don't keep quotes
MaxGekk 5f013f5
Support 2 chars as lineSep
MaxGekk 65786df
Revert unrelated changes
MaxGekk 49b91ea
Test restrictions for lineSep
MaxGekk 12022ad
Updating comments and docs
MaxGekk bb8a13b
Merge branch 'master' into csv-line-sep
MaxGekk 0869b81
Tests for lineSep in different encodings
MaxGekk 1f5399f
Support encoding for lineSep
MaxGekk 918d163
Restrict lineSep by 1 character only
MaxGekk c06899f
Merge remote-tracking branch 'origin/master' into csv-line-sep
MaxGekk a4c4b67
Fix comments
MaxGekk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -192,6 +192,20 @@ class CSVOptions( | |
| */ | ||
| val emptyValueInWrite = emptyValue.getOrElse("\"\"") | ||
|
|
||
| /** | ||
| * A string between two consecutive JSON records. | ||
| */ | ||
| val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => | ||
| require(sep.nonEmpty, "'lineSep' cannot be an empty string.") | ||
| require(sep.length == 1, "'lineSep' can contain only 1 character.") | ||
| sep | ||
| } | ||
|
|
||
| val lineSeparatorInRead: Option[Array[Byte]] = lineSeparator.map { lineSep => | ||
| lineSep.getBytes(charset) | ||
| } | ||
| val lineSeparatorInWrite: Option[String] = lineSeparator | ||
|
|
||
| def asWriterSettings: CsvWriterSettings = { | ||
| val writerSettings = new CsvWriterSettings() | ||
| val format = writerSettings.getFormat | ||
|
|
@@ -200,6 +214,8 @@ class CSVOptions( | |
| format.setQuoteEscape(escape) | ||
| charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping) | ||
| format.setComment(comment) | ||
| lineSeparatorInWrite.foreach(format.setLineSeparator) | ||
|
|
||
| writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite) | ||
| writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite) | ||
| writerSettings.setNullValue(nullValue) | ||
|
|
@@ -216,8 +232,10 @@ class CSVOptions( | |
| format.setDelimiter(delimiter) | ||
| format.setQuote(quote) | ||
| format.setQuoteEscape(escape) | ||
| lineSeparator.foreach(format.setLineSeparator) | ||
| charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping) | ||
| format.setComment(comment) | ||
|
|
||
| settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead) | ||
| settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead) | ||
| settings.setReadInputOnSeparateThread(false) | ||
|
|
@@ -227,7 +245,10 @@ class CSVOptions( | |
| settings.setEmptyValue(emptyValueInRead) | ||
| settings.setMaxCharsPerColumn(maxCharsPerColumn) | ||
| settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) | ||
| settings.setLineSeparatorDetectionEnabled(multiLine == true) | ||
| settings.setLineSeparatorDetectionEnabled(lineSeparatorInRead.isEmpty && multiLine) | ||
| lineSeparatorInRead.foreach { _ => | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nice! |
||
| settings.setNormalizeLineEndingsWithinQuotes(!multiLine) | ||
| } | ||
|
|
||
| settings | ||
| } | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -377,6 +377,8 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo | |
| * <li>`multiLine` (default `false`): parse one record, which may span multiple lines.</li> | ||
| * <li>`locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format. | ||
| * For instance, this is used while parsing dates and timestamps.</li> | ||
| * <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator | ||
| * that should be used for parsing. Maximum length is 2.</li> | ||
|
||
| * </ul> | ||
| * | ||
| * @since 2.0.0 | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently have a project where we are importing windows newlines CRLF from CSV files.
I backported these changes but ran into an issue with this check, because to properly parse Windows CSV files I must be able to set "\r\n" for lineSep in the settings.
It appears the reason this require was added is no longer needed as the code for asReaderSettings/asWriterSettings never calls that function anymore.
I was able to remove this assert and now able to import the windows newline CSV files into dataframes properly now.
Another issue I had before this was the very last column would always get a "\r" at the end of the column name, so something like "TEXT" would become "TEXT\r", and therefore we would be unable to query the TEXT column anymore. Setting lineSep to "\r\n" solved this issue as well.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to set
\r\ntolineSepto split an input by lines because Hadoop Line Reader can detect\r\nitself. In which mode do you parse the CSV files - per-linemultiLine = falseor multiline?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am setting multiLine = "true".
The problem I am having with this is the column name of the last column in the CSV header gets a \r added to the end of it.
So if I have
name,age,text\r\nfred,30,"likes\r\npie,cookies,milk"\njill,30,"likes\ncake,cookies,milk"\r\n
I was getting schema with
StringType("NAME")
IntegerType("AGE")
StringType("TEXT\r")
Could it be the mixed use of \r\n and \n so it only wants to use \n for newlines?
Another issue is the configuration for lineSep is controlled upstream from a different configuration provided by users who have no knowledge of spark, but know how they formatted their CSV files, and without some re-architecture, it is not possible to detect that this setting is set to \r\n and then set it to None for the CSVOptions.
lineSeparator.foreach(format.setLineSeparator) already handles 1 to 2 characters so I figured this is a safe thing to support for lineSep configuration no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multiline true, we have fixed auto-multiline detect feature in CSV (see #22503) That will do the job.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is taken care of in this by the following line that I backported no?
settings.setLineSeparatorDetectionEnabled(lineSeparatorInRead.isEmpty && multiLine)I am still having the issue that univocity keeps a \r in the column name with multiline set to True and lineSeparatorInRead is unset.
The only way I seem to be able to get spark to not put a \r in the column name is to specifiy the lineSep option with two characters explicitly to
\r\n. Then I get a normal set of column names and everything else parses correctly.I'm wondering if this is just some really pedantic CSV file that I'm working with? Its a CSV that is exported upstream by python pandas.to_csv function with no extra arguments set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to file a JIRA after testing out against the master branch if the issue is persistent?