Skip to content

Conversation

@jeff303
Copy link
Contributor

@jeff303 jeff303 commented Oct 4, 2019

Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

What changes were proposed in this pull request?

Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of univocity-parsers version, and also to change that version to the latest.

Why are the changes needed?

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

Does this PR introduce any user-facing change?

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

How was this patch tested?

The CSVSuite tests were confirmed passing (including new methods), and sbt tests for sql were executed.

@jeff303 jeff303 force-pushed the SPARK-24540 branch 2 times, most recently from ec975da to 5e61c2b Compare October 4, 2019 21:25
@jeff303 jeff303 changed the title [SPARK-24540][SQL] Support for multiple delimiter in Spark CSV read [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read Oct 4, 2019
@jeff303 jeff303 force-pushed the SPARK-24540 branch 2 times, most recently from 668f786 to 2f6dc3c Compare October 4, 2019 21:59
@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

cc @MaxGekk as well

@SparkQA
Copy link

SparkQA commented Oct 5, 2019

Test build #111798 has finished for PR 26027 at commit 2f6dc3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

while (idx < str.length()) {
// if the current character is a backslash, check it plus the next char
// in order to use existing escape logic
val readAhead = if (str(idx) == '\\') 2 else 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toChar() can handle "\u0000" which is not 2 chars long, I think. Could you check this case and write a test for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say, it's not worth it, because toChar doesn't support general unicode syntax either, and it's Java/Scala syntax anyway, and \0 is the more natural way to say it. But toChar doesn't support \0. We could at least special-case that in toChar as well instead, to support NULL as a delimiter, rather than expand the logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a couple more tests, for both varieties of the null character. They were already being handled. \u0000 was handled because of the special case in toChar), and \0 was handled because of the normal single character case (i.e. case Seq(c) => c).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anything but Scala is handling the \u0000 case. The String is one character by the time any of this executes. I think you'd find this doesn't work if you write "\\u0000", which is what you would have to do to actually encounter the 6-character string \u0000 here. But then you'd interpret the delimiter as u0000. Same problem as the "\\" case.

To your test case below -- unicode unescaping happens before everything, so """\u0000""" still yields a 1-character delimiter.

I would suggest punting on this right here, but I am kind of concerned about the '"\"` case in general.
I would remove the added comment above about backslash escaping because AFAICT people should just be using the language's string literal syntax for expressing the chars, and we actually shouldn't further unescape them, but we can leave that much unchanged here.

We may find this whole loop is unnecessary as a result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I made an incorrect assumption about how the triple quote interpolator works. But yes, if the whole partial unescape stuff could be removed, then it would be far simpler here.

@SparkQA
Copy link

SparkQA commented Oct 7, 2019

Test build #111839 has finished for PR 26027 at commit 0f9c427.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

while (idx < str.length()) {
// if the current character is a backslash, check it plus the next char
// in order to use existing escape logic
val readAhead = if (str(idx) == '\\') 2 else 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anything but Scala is handling the \u0000 case. The String is one character by the time any of this executes. I think you'd find this doesn't work if you write "\\u0000", which is what you would have to do to actually encounter the 6-character string \u0000 here. But then you'd interpret the delimiter as u0000. Same problem as the "\\" case.

To your test case below -- unicode unescaping happens before everything, so """\u0000""" still yields a 1-character delimiter.

I would suggest punting on this right here, but I am kind of concerned about the '"\"` case in general.
I would remove the added comment above about backslash escaping because AFAICT people should just be using the language's string literal syntax for expressing the chars, and we actually shouldn't further unescape them, but we can leave that much unchanged here.

We may find this whole loop is unnecessary as a result.

@SparkQA
Copy link

SparkQA commented Oct 7, 2019

Test build #111853 has finished for PR 26027 at commit 3c9c48f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more small change here, but I think this is looking good

* <li>`sep` (default `,`): sets a single character as a separator for each
* field and value.</li>
* <li>`sep` (default `,`): sets a separator for each field and value. This separator can be one
* or more characters. Special characters in the separator need to be escaped by a backslash.</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the reference to escaping here, per earlier conversation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Oct 12, 2019

Test build #111937 has finished for PR 26027 at commit 463ea1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…CSV read

Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV, and new test suite for the CSVExprUtils method

Updating spark-deps* files with new univocity-parsers version
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK pending tests

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112057 has finished for PR 26027 at commit 17ebe0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 15, 2019

Merged to master

@srowen srowen closed this in 95de93b Oct 15, 2019
rshkv pushed a commit to palantir/spark that referenced this pull request Aug 27, 2020
…CSV read

Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

Adds support for parsing CSV data using multiple-character delimiters.  Existing logic for converting the input delimiter string to characters was kept and invoked in a loop.  Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters.  Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception.  Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes apache#26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
rshkv pushed a commit to palantir/spark that referenced this pull request Sep 1, 2020
…CSV read

Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

Adds support for parsing CSV data using multiple-character delimiters.  Existing logic for converting the input delimiter string to characters was kept and invoked in a loop.  Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters.  Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception.  Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes apache#26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants