-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read #26027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,6 +17,8 @@ | |
|
|
||
| package org.apache.spark.sql.catalyst.csv | ||
|
|
||
| import org.apache.commons.lang3.StringUtils | ||
|
|
||
| object CSVExprUtils { | ||
| /** | ||
| * Filter ignorable rows for CSV iterator (lines empty and starting with `comment`). | ||
|
|
@@ -79,4 +81,48 @@ object CSVExprUtils { | |
| throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str") | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Helper method that converts string representation of a character sequence to actual | ||
| * delimiter characters. The input is processed in "chunks", and each chunk is converted | ||
| * by calling [[CSVExprUtils.toChar()]]. A chunk is either: | ||
| * <ul> | ||
| * <li>a backslash followed by another character</li> | ||
| * <li>a non-backslash character by itself</li> | ||
| * </ul> | ||
| * , in that order of precedence. The result of the converting all chunks is returned as | ||
| * a [[String]]. | ||
| * | ||
| * <br/><br/>Examples: | ||
| * <ul><li>`\t` will result in a single tab character as the separator (same as before) | ||
srowen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * </li><li>`|||` will result in a sequence of three pipe characters as the separator | ||
| * </li><li>`\\` will result in a single backslash as the separator (same as before) | ||
| * </li><li>`\.` will result in an error (since a dot is not a character that needs escaped) | ||
| * </li><li>`\\.` will result in a backslash, then dot, as the separator character sequence | ||
| * </li><li>`.\t.` will result in a dot, then tab, then dot as the separator character sequence | ||
| * </li> | ||
| * </ul> | ||
| * | ||
| * @param str the string representing the sequence of separator characters | ||
| * @return a [[String]] representing the multi-character delimiter | ||
| * @throws IllegalArgumentException if any of the individual input chunks are illegal | ||
| */ | ||
| def toDelimiterStr(str: String): String = { | ||
srowen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| var idx = 0 | ||
|
|
||
| var delimiter = "" | ||
|
|
||
| while (idx < str.length()) { | ||
| // if the current character is a backslash, check it plus the next char | ||
| // in order to use existing escape logic | ||
| val readAhead = if (str(idx) == '\\') 2 else 1 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was going to say, it's not worth it, because
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just added a couple more tests, for both varieties of the null character. They were already being handled.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think anything but Scala is handling the To your test case below -- unicode unescaping happens before everything, so I would suggest punting on this right here, but I am kind of concerned about the '"\"` case in general. We may find this whole loop is unnecessary as a result.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yes, I made an incorrect assumption about how the triple quote interpolator works. But yes, if the whole partial unescape stuff could be removed, then it would be far simpler here. |
||
| // get the chunk of 1 or 2 input characters to convert to a single delimiter char | ||
| val chunk = StringUtils.substring(str, idx, idx + readAhead) | ||
| delimiter += toChar(chunk) | ||
srowen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| // advance the counter by the length of input chunk processed | ||
| idx += chunk.length() | ||
| } | ||
|
|
||
| delimiter.mkString("") | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| year_/-\_make_/-\_model_/-\_comment_/-\_blank | ||
| '2012'_/-\_'Tesla'_/-\_'S'_/-\_'No comment'_/-\_ | ||
| 1997_/-\_Ford_/-\_E350_/-\_'Go get one now they are going fast'_/-\_ | ||
| 2015_/-\_Chevy_/-\_Volt |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| year, make, model, comment, blank | ||
| '2012', 'Tesla', 'S', No comment, | ||
| 1997, Ford, E350, 'Go get one now they are going fast', | ||
| 2015, Chevy, Volt |
Uh oh!
There was an error while loading. Please reload this page.