[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read #26027

jeff303 · 2019-10-04T18:24:46Z

Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

What changes were proposed in this pull request?

Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of univocity-parsers version, and also to change that version to the latest.

Why are the changes needed?

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

Does this PR introduce any user-facing change?

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

How was this patch tested?

The CSVSuite tests were confirmed passing (including new methods), and sbt tests for sql were executed.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

HyukjinKwon · 2019-10-05T04:35:54Z

ok to test

HyukjinKwon · 2019-10-05T04:36:00Z

cc @MaxGekk as well

SparkQA · 2019-10-05T06:48:04Z

Test build #111798 has finished for PR 26027 at commit 2f6dc3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-10-05T08:12:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

+    while (idx < str.length()) {
+      // if the current character is a backslash, check it plus the next char
+      // in order to use existing escape logic
+      val readAhead = if (str(idx) == '\\') 2 else 1


toChar() can handle "\u0000" which is not 2 chars long, I think. Could you check this case and write a test for that.

I was going to say, it's not worth it, because toChar doesn't support general unicode syntax either, and it's Java/Scala syntax anyway, and \0 is the more natural way to say it. But toChar doesn't support \0. We could at least special-case that in toChar as well instead, to support NULL as a delimiter, rather than expand the logic here.

I just added a couple more tests, for both varieties of the null character. They were already being handled. \u0000 was handled because of the special case in toChar), and \0 was handled because of the normal single character case (i.e. case Seq(c) => c).

I don't think anything but Scala is handling the \u0000 case. The String is one character by the time any of this executes. I think you'd find this doesn't work if you write "\\u0000", which is what you would have to do to actually encounter the 6-character string \u0000 here. But then you'd interpret the delimiter as u0000. Same problem as the "\\" case.

To your test case below -- unicode unescaping happens before everything, so """\u0000""" still yields a 1-character delimiter.

I would suggest punting on this right here, but I am kind of concerned about the '"\"` case in general.
I would remove the added comment above about backslash escaping because AFAICT people should just be using the language's string literal syntax for expressing the chars, and we actually shouldn't further unescape them, but we can leave that much unchanged here.

We may find this whole loop is unnecessary as a result.

Ah yes, I made an incorrect assumption about how the triple quote interpolator works. But yes, if the whole partial unescape stuff could be removed, then it would be far simpler here.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

SparkQA · 2019-10-07T18:19:43Z

Test build #111839 has finished for PR 26027 at commit 0f9c427.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

srowen · 2019-10-07T18:54:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala

+    while (idx < str.length()) {
+      // if the current character is a backslash, check it plus the next char
+      // in order to use existing escape logic
+      val readAhead = if (str(idx) == '\\') 2 else 1


I don't think anything but Scala is handling the \u0000 case. The String is one character by the time any of this executes. I think you'd find this doesn't work if you write "\\u0000", which is what you would have to do to actually encounter the 6-character string \u0000 here. But then you'd interpret the delimiter as u0000. Same problem as the "\\" case.

To your test case below -- unicode unescaping happens before everything, so """\u0000""" still yields a 1-character delimiter.

I would suggest punting on this right here, but I am kind of concerned about the '"\"` case in general.
I would remove the added comment above about backslash escaping because AFAICT people should just be using the language's string literal syntax for expressing the chars, and we actually shouldn't further unescape them, but we can leave that much unchanged here.

We may find this whole loop is unnecessary as a result.

SparkQA · 2019-10-07T23:17:06Z

Test build #111853 has finished for PR 26027 at commit 3c9c48f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

One more small change here, but I think this is looking good

srowen · 2019-10-08T14:16:56Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

-   * <li>`sep` (default `,`): sets a single character as a separator for each
-   * field and value.</li>
+   * <li>`sep` (default `,`): sets a separator for each field and value. This separator can be one
+   * or more characters. Special characters in the separator need to be escaped by a backslash.</li>


I'd remove the reference to escaping here, per earlier conversation.

SparkQA · 2019-10-12T04:05:27Z

Test build #111937 has finished for PR 26027 at commit 463ea1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

…CSV read Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV, and new test suite for the CSVExprUtils method Updating spark-deps* files with new univocity-parsers version

srowen

Looks OK pending tests

SparkQA · 2019-10-14T23:26:18Z

Test build #112057 has finished for PR 26027 at commit 17ebe0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-10-15T20:44:56Z

Merged to master

…CSV read Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest. It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing). Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0. The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed. Closes apache#26027 from jeff303/SPARK-24540. Authored-by: Jeff Evans <[email protected]> Signed-off-by: Sean Owen <[email protected]>

jeff303 force-pushed the SPARK-24540 branch from 66064d5 to 92f617d Compare October 4, 2019 19:01

srowen requested changes Oct 4, 2019

View reviewed changes

jeff303 force-pushed the SPARK-24540 branch 2 times, most recently from ec975da to 5e61c2b Compare October 4, 2019 21:25

jeff303 changed the title ~~[SPARK-24540][SQL] Support for multiple delimiter in Spark CSV read~~ [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read Oct 4, 2019

jeff303 force-pushed the SPARK-24540 branch 2 times, most recently from 668f786 to 2f6dc3c Compare October 4, 2019 21:59

dongjoon-hyun added the SQL label Oct 4, 2019

HyukjinKwon reviewed Oct 5, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala Show resolved Hide resolved

MaxGekk requested changes Oct 5, 2019

View reviewed changes

jeff303 force-pushed the SPARK-24540 branch from 2f6dc3c to 0f9c427 Compare October 7, 2019 15:29

srowen requested changes Oct 7, 2019

View reviewed changes

jeff303 force-pushed the SPARK-24540 branch from 0f9c427 to 3c9c48f Compare October 7, 2019 20:29

srowen reviewed Oct 8, 2019

View reviewed changes

jeff303 force-pushed the SPARK-24540 branch from 3c9c48f to 463ea1d Compare October 12, 2019 01:16

srowen approved these changes Oct 12, 2019

View reviewed changes

srowen reviewed Oct 12, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala Show resolved Hide resolved

jeff303 force-pushed the SPARK-24540 branch from 463ea1d to 5ad63e8 Compare October 14, 2019 20:13

jeff303 force-pushed the SPARK-24540 branch from 5ad63e8 to 17ebe0c Compare October 14, 2019 20:16

srowen approved these changes Oct 14, 2019

View reviewed changes

srowen closed this in 95de93b Oct 15, 2019

rshkv mentioned this pull request Aug 27, 2020

[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read palantir/spark#706

Merged

[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read #26027

[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read #26027

Uh oh!

Conversation

jeff303 commented Oct 4, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Oct 5, 2019

Uh oh!

HyukjinKwon commented Oct 5, 2019

Uh oh!

SparkQA commented Oct 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Oct 7, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 7, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 12, 2019

Uh oh!

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

srowen commented Oct 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants