ARROW-11013: [Rust][DataFusion] Add trim to CsvReader #9001

seddonm1 · 2020-12-23T19:44:49Z

The current CSV reader cannot parse strings to types with leading/trailing white spaces as the parsers are very strict. This means being able to read and parse the tpch-dbgen included answers files is not possible.

The underlying csv crate supports a four different behaviors for trimming strings:

None (default): does no trimming.
Headers: trim only header fields.
Fields: trim only field values.
All: trim both headers and field values.

Rather than exposing all these options and forcing users to understand the underlying csv crate this PR simplifies this decision to boolean: None (false) or All (true) while retaining the default false behavior.

github-actions · 2020-12-23T19:46:51Z

https://issues.apache.org/jira/browse/ARROW-11013

Dandandan · 2020-12-23T19:57:02Z

Thanks for the PR @seddonm1!
Wouldn't it be better to keep the CSV reader simple and do the trim after loading the CSV (in Arrow/DataFusion)? There it could be a simple trim(col) too. I the use case is relatively limited where you actually want to apply trimming on all columns.

I'm hesitant too to depend on the "more advanced" csv crate features, I think at some point it makes sense to utilize csv_core instead (for better performance).

EDIT: Ah I didn't read your message carefully... Will have a look

seddonm1 · 2020-12-23T20:01:59Z

Thanks for the PR @seddonm1!
Wouldn't it be better to keep the CSV reader simple and do the trim after loading the CSV (in Arrow/DataFusion)? There it could be a simple trim(col) too. I the use case is relatively limited where you actually want to apply trimming on all columns.

I'm hesitant too to depend on the "more advanced" csv crate features, I think at some point it makes sense to utilize csv_core instead (for better performance).

Yes it may be so I think this is up for discussion.

I have added the trim function in #8966 but the actual flow is read csv -> trim -> cast.

Dandandan · 2020-12-23T20:11:22Z

Are you aware of any other parser that does similar trimming like the csv crate?
Looking at pandas for example it seems that it also doesn't have the option to trim whitespaces from header / values directly.

seddonm1 · 2020-12-23T20:15:27Z

I was referencing the underlying library that Spark uses: https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/csv/CsvParser.java#L116

Here are the Spark CSV reader options: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame

Of course you could go down a rabbit hole trying to support all use cases.

I am more than happy to kill this PR if we make the decision that it doesn't belong here.

seddonm1 · 2020-12-23T20:37:48Z

It actually looks like we are missing the ability to read a csv and return the values all strings (similar to how infer schema works) without also trying to parse the values or having to provide an all DataType::Utf8 schema.

seddonm1 · 2020-12-26T04:49:25Z

@Dandandan I have taken the read -> trim -> parse approach here: #9015

I think I will close this an open a new ticket that allows the CSVReader to infer number of columns (with named from headers if provided) but return all DataType::Utf8. Thoughts?

Dandandan · 2020-12-26T05:08:02Z

I think that is a great idea @seddonm1 . We can always revisit later if it turns out we really need it in the parser! Thank you!

seddonm1 · 2020-12-26T05:13:18Z

Closed in favor of https://issues.apache.org/jira/browse/ARROW-11036

add csv trim capability

009627a

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 23, 2020

seddonm1 closed this Dec 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-11013: [Rust][DataFusion] Add trim to CsvReader #9001

ARROW-11013: [Rust][DataFusion] Add trim to CsvReader #9001

Uh oh!

seddonm1 commented Dec 23, 2020

Uh oh!

github-actions bot commented Dec 23, 2020

Uh oh!

Dandandan commented Dec 23, 2020 •

edited

Loading

Uh oh!

seddonm1 commented Dec 23, 2020

Uh oh!

Dandandan commented Dec 23, 2020

Uh oh!

seddonm1 commented Dec 23, 2020 •

edited

Loading

Uh oh!

seddonm1 commented Dec 23, 2020

Uh oh!

seddonm1 commented Dec 26, 2020

Uh oh!

Dandandan commented Dec 26, 2020

Uh oh!

seddonm1 commented Dec 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ARROW-11013: [Rust][DataFusion] Add trim to CsvReader #9001

ARROW-11013: [Rust][DataFusion] Add trim to CsvReader #9001

Uh oh!

Conversation

seddonm1 commented Dec 23, 2020

Uh oh!

github-actions bot commented Dec 23, 2020

Uh oh!

Dandandan commented Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seddonm1 commented Dec 23, 2020

Uh oh!

Dandandan commented Dec 23, 2020

Uh oh!

seddonm1 commented Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seddonm1 commented Dec 23, 2020

Uh oh!

seddonm1 commented Dec 26, 2020

Uh oh!

Dandandan commented Dec 26, 2020

Uh oh!

seddonm1 commented Dec 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dandandan commented Dec 23, 2020 •

edited

Loading

seddonm1 commented Dec 23, 2020 •

edited

Loading