Skip to content

Conversation

@rshkv
Copy link

@rshkv rshkv commented Aug 27, 2020

Cherry-picking apache#26027 for SPARK-24540.

Adds support for multi-character delimiters. Functionality comes mostly from upgrading the Univocity parser.

Existing CSVSuite tests still pass. New tests were added with multi-character delimiters.

…CSV read

Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

Adds support for parsing CSV data using multiple-character delimiters.  Existing logic for converting the input delimiter string to characters was kept and invoked in a loop.  Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters.  Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception.  Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes apache#26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
@rshkv rshkv requested a review from robert3005 August 27, 2020 17:50
@rshkv rshkv merged commit 7294159 into master Sep 1, 2020
@rshkv rshkv deleted the wr/spark-24540 branch September 1, 2020 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants