ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency #8710

Dandandan · 2020-11-18T23:38:58Z

Changes infer_field_schema to parse manually.

lazy_static is no longer needed as (direct) dependency.

Probably could be a bit faster as well.

github-actions · 2020-11-18T23:46:29Z

https://issues.apache.org/jira/browse/ARROW-10649

alamb

Looks pretty good to me -- what do you think @andygrove or @jorgecarleitao ?

rust/arrow/src/csv/reader.rs

alamb · 2020-11-19T00:03:08Z

Thank you @Dandandan

alamb · 2020-11-19T00:03:47Z

CI failure seems unrelated: https://github.com/apache/arrow/pull/8710/checks?check_run_id=1421269108

Post job cleanup.
/bin/tar -cz -f /home/runner/work/_temp/0ba3abcf-1c45-4bd4-b0d9-f3334c9cc174/cache.tgz -C /home/runner/work/arrow/arrow/.docker .
Warning: Cache service responded with 400 during chunk upload.
events.js:187
      throw er; // Unhandled 'error' event
      ^

restarting

nevi-me

I'm also happy with this approach. I don't have time to benchmark it (I normally run a binary through a profiler), but maybe I'll do it in a few days even if this is merged by then.

nevi-me · 2020-11-19T01:44:13Z

Before we remove lazy_static, how would we also remove it in #8611? CC @Jibbow

Dandandan · 2020-11-19T12:09:14Z

Related:#8714

Dandandan · 2020-11-19T12:10:24Z

Before we remove lazy_static, how would we also remove it in #8611? CC @Jibbow

I think it can reuse the structure here and also use the all digit function.

andizimmerer · 2020-11-19T12:48:35Z

regex + lazy_static is somewhat a nice combination, but I agree that we could also recognize dates without those two libraries. But instead of using all_digit() and manually dissecting the string, we could also apply chrono's parse() function which returns a ParseResult and check whether parsing was successful or not. Opinions on that?
Also, I like #8714

Dandandan · 2020-11-19T13:53:16Z

regex + lazy_static is somewhat a nice combination, but I agree that we could also recognize dates without those two libraries. But instead of using all_digit() and manually dissecting the string, we could also apply chrono's parse() function which returns a ParseResult and check whether parsing was successful or not. Opinions on that?

I think for the parsing (not recognizing) the dates itself that makes sense. I think parsing usually is slower than only matching it so that might be something to consider here?

nevi-me · 2020-11-21T08:49:38Z

@alamb @jorgecarleitao should we complete #8611 first, so that we don't remove lazy_static and regex if we can't find an alternative for dates?
We could also in future take a similar approach to C++ and Pandas, where the user is expected to provide a list of columns that should be parsed as temporal types, and the formats that should be used. I think this is more flexible, and it would allow us to support Time by parsing HH:mm:{ss}.

Your thoughts @Jibbow @Dandandan ?

andizimmerer · 2020-11-21T08:57:36Z

Sounds good!

Dandandan · 2020-11-21T09:00:28Z

Sure!

jorgecarleitao · 2020-11-21T09:33:10Z

Converting CSV -> StringArray -> [Type]Array is not recommended, as it forces us to load everything in memory, even if there are shorter representations. Therefore, we really need a way to build arrays out of CSV columns.
CSV is parsed as rows, but arrow is column-based. Therefore, there will need to be a pivot of the data at some point.

My feeling is that there are wildly different specs out there into how we should convert a CSV column into an Array. IMO we should not try to solve all those use-cases ourselves and instead offer users the freedom to choose, as well as common utilities.

As such, one idea is to offer a way to plugin that allow users to parse CSV column into [Type]Array, and offer a default offering.

Since these are stateless, one simple idea is have the CSV reader accept a trait with two functions:

infer: Fn(rows: &[StringRecord]) -> Vec<Option<DataType>>;
convert: Fn(data_type: &DataType, rows: &[StringRecord], col_idx: usize) -> Result<ArrayRef>;
# or something like this

This signature indicates that:

The function transverses rows
the function is falible
the resulting array is dynamic

This allows the user to e.g. make unparsable rows as nulls, adopt specific notations for CSV files that are (for them) interoperable with Arrow, etc.

Dandandan · 2020-12-07T07:50:17Z

Let's park this one for now.

Parse manually, remove lazy static dependency

bf38455

github-actions bot added the Component: Rust label Nov 18, 2020

Dandandan changed the title ~~ARROW-10649: [Rust] Parse manually, remove lazy static dependency~~ ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency Nov 18, 2020

alamb approved these changes Nov 19, 2020

View reviewed changes

rust/arrow/src/csv/reader.rs Outdated Show resolved Hide resolved

rust/arrow/src/csv/reader.rs Show resolved Hide resolved

Dandandan added 5 commits November 19, 2020 01:27

Test against remaining part in float

2a8dc81

Use matching on parts for clarity and fixing Int64 match

ad9de4c

Use split iterator in match

6a9f60f

Use eq_ignore_ascii_case

8305c1e

Clippy

c7901ea

nevi-me approved these changes Nov 19, 2020

View reviewed changes

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 28, 2020

Dandandan closed this Dec 7, 2020

ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency #8710

ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency #8710

Uh oh!

Conversation

Dandandan commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 18, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Nov 19, 2020

Uh oh!

alamb commented Nov 19, 2020

Uh oh!

nevi-me left a comment

Choose a reason for hiding this comment

Uh oh!

nevi-me commented Nov 19, 2020

Uh oh!

Dandandan commented Nov 19, 2020

Uh oh!

Dandandan commented Nov 19, 2020

Uh oh!

andizimmerer commented Nov 19, 2020

Uh oh!

Dandandan commented Nov 19, 2020

Uh oh!

nevi-me commented Nov 21, 2020

Uh oh!

andizimmerer commented Nov 21, 2020

Uh oh!

Dandandan commented Nov 21, 2020

Uh oh!

jorgecarleitao commented Nov 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Dec 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Dandandan commented Nov 18, 2020 •

edited

Loading

jorgecarleitao commented Nov 21, 2020 •

edited

Loading