Skip to content

Conversation

@seddonm1
Copy link
Contributor

The current CSV reader cannot parse strings to types with leading/trailing white spaces as the parsers are very strict. This means being able to read and parse the tpch-dbgen included answers files is not possible.

The underlying csv crate supports a four different behaviors for trimming strings:

  • None (default): does no trimming.
  • Headers: trim only header fields.
  • Fields: trim only field values.
  • All: trim both headers and field values.

Rather than exposing all these options and forcing users to understand the underlying csv crate this PR simplifies this decision to boolean: None (false) or All (true) while retaining the default false behavior.

@github-actions
Copy link

@Dandandan
Copy link
Contributor

Dandandan commented Dec 23, 2020

Thanks for the PR @seddonm1!
Wouldn't it be better to keep the CSV reader simple and do the trim after loading the CSV (in Arrow/DataFusion)? There it could be a simple trim(col) too. I the use case is relatively limited where you actually want to apply trimming on all columns.

I'm hesitant too to depend on the "more advanced" csv crate features, I think at some point it makes sense to utilize csv_core instead (for better performance).

EDIT: Ah I didn't read your message carefully... Will have a look

@seddonm1
Copy link
Contributor Author

Thanks for the PR @seddonm1!
Wouldn't it be better to keep the CSV reader simple and do the trim after loading the CSV (in Arrow/DataFusion)? There it could be a simple trim(col) too. I the use case is relatively limited where you actually want to apply trimming on all columns.

I'm hesitant too to depend on the "more advanced" csv crate features, I think at some point it makes sense to utilize csv_core instead (for better performance).

Yes it may be so I think this is up for discussion.

I have added the trim function in #8966 but the actual flow is read csv -> trim -> cast.

@Dandandan
Copy link
Contributor

Are you aware of any other parser that does similar trimming like the csv crate?
Looking at pandas for example it seems that it also doesn't have the option to trim whitespaces from header / values directly.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Dec 23, 2020

I was referencing the underlying library that Spark uses: https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/csv/CsvParser.java#L116

Here are the Spark CSV reader options: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame

Of course you could go down a rabbit hole trying to support all use cases.

I am more than happy to kill this PR if we make the decision that it doesn't belong here.

@seddonm1
Copy link
Contributor Author

It actually looks like we are missing the ability to read a csv and return the values all strings (similar to how infer schema works) without also trying to parse the values or having to provide an all DataType::Utf8 schema.

@seddonm1
Copy link
Contributor Author

@Dandandan I have taken the read -> trim -> parse approach here: #9015

I think I will close this an open a new ticket that allows the CSVReader to infer number of columns (with named from headers if provided) but return all DataType::Utf8. Thoughts?

@Dandandan
Copy link
Contributor

I think that is a great idea @seddonm1 . We can always revisit later if it turns out we really need it in the parser! Thank you!

@seddonm1
Copy link
Contributor Author

Closed in favor of https://issues.apache.org/jira/browse/ARROW-11036

@seddonm1 seddonm1 closed this Dec 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants