-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read #26027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
Outdated
Show resolved
Hide resolved
ec975da to
5e61c2b
Compare
668f786 to
2f6dc3c
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
Show resolved
Hide resolved
|
ok to test |
|
cc @MaxGekk as well |
|
Test build #111798 has finished for PR 26027 at commit
|
| while (idx < str.length()) { | ||
| // if the current character is a backslash, check it plus the next char | ||
| // in order to use existing escape logic | ||
| val readAhead = if (str(idx) == '\\') 2 else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toChar() can handle "\u0000" which is not 2 chars long, I think. Could you check this case and write a test for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to say, it's not worth it, because toChar doesn't support general unicode syntax either, and it's Java/Scala syntax anyway, and \0 is the more natural way to say it. But toChar doesn't support \0. We could at least special-case that in toChar as well instead, to support NULL as a delimiter, rather than expand the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a couple more tests, for both varieties of the null character. They were already being handled. \u0000 was handled because of the special case in toChar), and \0 was handled because of the normal single character case (i.e. case Seq(c) => c).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think anything but Scala is handling the \u0000 case. The String is one character by the time any of this executes. I think you'd find this doesn't work if you write "\\u0000", which is what you would have to do to actually encounter the 6-character string \u0000 here. But then you'd interpret the delimiter as u0000. Same problem as the "\\" case.
To your test case below -- unicode unescaping happens before everything, so """\u0000""" still yields a 1-character delimiter.
I would suggest punting on this right here, but I am kind of concerned about the '"\"` case in general.
I would remove the added comment above about backslash escaping because AFAICT people should just be using the language's string literal syntax for expressing the chars, and we actually shouldn't further unescape them, but we can leave that much unchanged here.
We may find this whole loop is unnecessary as a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I made an incorrect assumption about how the triple quote interpolator works. But yes, if the whole partial unescape stuff could be removed, then it would be far simpler here.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
Show resolved
Hide resolved
|
Test build #111839 has finished for PR 26027 at commit
|
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
| while (idx < str.length()) { | ||
| // if the current character is a backslash, check it plus the next char | ||
| // in order to use existing escape logic | ||
| val readAhead = if (str(idx) == '\\') 2 else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think anything but Scala is handling the \u0000 case. The String is one character by the time any of this executes. I think you'd find this doesn't work if you write "\\u0000", which is what you would have to do to actually encounter the 6-character string \u0000 here. But then you'd interpret the delimiter as u0000. Same problem as the "\\" case.
To your test case below -- unicode unescaping happens before everything, so """\u0000""" still yields a 1-character delimiter.
I would suggest punting on this right here, but I am kind of concerned about the '"\"` case in general.
I would remove the added comment above about backslash escaping because AFAICT people should just be using the language's string literal syntax for expressing the chars, and we actually shouldn't further unescape them, but we can leave that much unchanged here.
We may find this whole loop is unnecessary as a result.
|
Test build #111853 has finished for PR 26027 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more small change here, but I think this is looking good
| * <li>`sep` (default `,`): sets a single character as a separator for each | ||
| * field and value.</li> | ||
| * <li>`sep` (default `,`): sets a separator for each field and value. This separator can be one | ||
| * or more characters. Special characters in the separator need to be escaped by a backslash.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove the reference to escaping here, per earlier conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Test build #111937 has finished for PR 26027 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Show resolved
Hide resolved
…CSV read Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV, and new test suite for the CSVExprUtils method Updating spark-deps* files with new univocity-parsers version
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK pending tests
|
Test build #112057 has finished for PR 26027 at commit
|
|
Merged to master |
…CSV read Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest. It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing). Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0. The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed. Closes apache#26027 from jeff303/SPARK-24540. Authored-by: Jeff Evans <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…CSV read Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest. It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing). Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0. The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed. Closes apache#26027 from jeff303/SPARK-24540. Authored-by: Jeff Evans <[email protected]> Signed-off-by: Sean Owen <[email protected]>
Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters
Moving univocity-parsers version to spark-parent pom dependencyManagement section
Adding new utility method to build multi-char delimiter string, which delegates to existing one
Adding tests for multiple character delimited CSV
What changes were proposed in this pull request?
Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of
univocity-parsersversion, and also to change that version to the latest.Why are the changes needed?
It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing).
Does this PR introduce any user-facing change?
Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.
How was this patch tested?
The
CSVSuitetests were confirmed passing (including new methods), andsbttests forsqlwere executed.