-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency #8710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me -- what do you think @andygrove or @jorgecarleitao ?
|
Thank you @Dandandan |
|
CI failure seems unrelated: https://github.com/apache/arrow/pull/8710/checks?check_run_id=1421269108 restarting |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also happy with this approach. I don't have time to benchmark it (I normally run a binary through a profiler), but maybe I'll do it in a few days even if this is merged by then.
|
Before we remove |
|
Related:#8714 |
I think it can reuse the structure here and also use the all digit function. |
|
|
I think for the parsing (not recognizing) the dates itself that makes sense. I think parsing usually is slower than only matching it so that might be something to consider here? |
|
@alamb @jorgecarleitao should we complete #8611 first, so that we don't remove Your thoughts @Jibbow @Dandandan ? |
|
Sounds good! |
|
Sure! |
My feeling is that there are wildly different specs out there into how we should convert a CSV column into an Array. IMO we should not try to solve all those use-cases ourselves and instead offer users the freedom to choose, as well as common utilities. As such, one idea is to offer a way to plugin that allow users to parse CSV column into Since these are stateless, one simple idea is have the CSV reader accept a trait with two functions: infer: Fn(rows: &[StringRecord]) -> Vec<Option<DataType>>;
convert: Fn(data_type: &DataType, rows: &[StringRecord], col_idx: usize) -> Result<ArrayRef>;
# or something like thisThis signature indicates that:
This allows the user to e.g. make unparsable rows as nulls, adopt specific notations for CSV files that are (for them) interoperable with Arrow, etc. |
|
Let's park this one for now. |
Changes infer_field_schema to parse manually.
lazy_static is no longer needed as (direct) dependency.
Probably could be a bit faster as well.