-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11013: [Rust][DataFusion] Add trim to CsvReader #9001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for the PR @seddonm1! I'm hesitant too to depend on the "more advanced" csv crate features, I think at some point it makes sense to utilize EDIT: Ah I didn't read your message carefully... Will have a look |
Yes it may be so I think this is up for discussion. I have added the |
|
Are you aware of any other parser that does similar trimming like the csv crate? |
|
I was referencing the underlying library that Spark uses: https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/csv/CsvParser.java#L116 Here are the Spark CSV reader options: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame Of course you could go down a rabbit hole trying to support all use cases. I am more than happy to kill this PR if we make the decision that it doesn't belong here. |
|
It actually looks like we are missing the ability to read a csv and return the values all strings (similar to how infer schema works) without also trying to parse the values or having to provide an all DataType::Utf8 schema. |
|
@Dandandan I have taken the read -> trim -> parse approach here: #9015 I think I will close this an open a new ticket that allows the CSVReader to infer number of columns (with named from headers if provided) but return all DataType::Utf8. Thoughts? |
|
I think that is a great idea @seddonm1 . We can always revisit later if it turns out we really need it in the parser! Thank you! |
|
Closed in favor of https://issues.apache.org/jira/browse/ARROW-11036 |
The current CSV reader cannot parse strings to types with leading/trailing white spaces as the parsers are very strict. This means being able to read and parse the tpch-dbgen included answers files is not possible.
The underlying csv crate supports a four different behaviors for trimming strings:
None(default): does no trimming.Headers: trim only header fields.Fields: trim only field values.All: trim both headers and field values.Rather than exposing all these options and forcing users to understand the underlying csv crate this PR simplifies this decision to boolean:
None(false) orAll(true) while retaining the default false behavior.