Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Nov 18, 2020

Changes infer_field_schema to parse manually.

lazy_static is no longer needed as (direct) dependency.

Probably could be a bit faster as well.

@Dandandan Dandandan changed the title ARROW-10649: [Rust] Parse manually, remove lazy static dependency ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency Nov 18, 2020
@github-actions
Copy link

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me -- what do you think @andygrove or @jorgecarleitao ?

@alamb
Copy link
Contributor

alamb commented Nov 19, 2020

Thank you @Dandandan

@alamb
Copy link
Contributor

alamb commented Nov 19, 2020

CI failure seems unrelated: https://github.com/apache/arrow/pull/8710/checks?check_run_id=1421269108

Post job cleanup.
/bin/tar -cz -f /home/runner/work/_temp/0ba3abcf-1c45-4bd4-b0d9-f3334c9cc174/cache.tgz -C /home/runner/work/arrow/arrow/.docker .
Warning: Cache service responded with 400 during chunk upload.
events.js:187
      throw er; // Unhandled 'error' event
      ^

restarting

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also happy with this approach. I don't have time to benchmark it (I normally run a binary through a profiler), but maybe I'll do it in a few days even if this is merged by then.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 19, 2020

Before we remove lazy_static, how would we also remove it in #8611? CC @Jibbow

@Dandandan
Copy link
Contributor Author

Related:#8714

@Dandandan
Copy link
Contributor Author

Before we remove lazy_static, how would we also remove it in #8611? CC @Jibbow

I think it can reuse the structure here and also use the all digit function.

@andizimmerer
Copy link
Contributor

regex + lazy_static is somewhat a nice combination, but I agree that we could also recognize dates without those two libraries. But instead of using all_digit() and manually dissecting the string, we could also apply chrono's parse() function which returns a ParseResult and check whether parsing was successful or not. Opinions on that?
Also, I like #8714

@Dandandan
Copy link
Contributor Author

regex + lazy_static is somewhat a nice combination, but I agree that we could also recognize dates without those two libraries. But instead of using all_digit() and manually dissecting the string, we could also apply chrono's parse() function which returns a ParseResult and check whether parsing was successful or not. Opinions on that?

I think for the parsing (not recognizing) the dates itself that makes sense. I think parsing usually is slower than only matching it so that might be something to consider here?

@nevi-me
Copy link
Contributor

nevi-me commented Nov 21, 2020

@alamb @jorgecarleitao should we complete #8611 first, so that we don't remove lazy_static and regex if we can't find an alternative for dates?
We could also in future take a similar approach to C++ and Pandas, where the user is expected to provide a list of columns that should be parsed as temporal types, and the formats that should be used. I think this is more flexible, and it would allow us to support Time by parsing HH:mm:{ss}.

Your thoughts @Jibbow @Dandandan ?

@andizimmerer
Copy link
Contributor

Sounds good!

@Dandandan
Copy link
Contributor Author

Sure!

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Nov 21, 2020

  • Converting CSV -> StringArray -> [Type]Array is not recommended, as it forces us to load everything in memory, even if there are shorter representations. Therefore, we really need a way to build arrays out of CSV columns.

  • CSV is parsed as rows, but arrow is column-based. Therefore, there will need to be a pivot of the data at some point.

My feeling is that there are wildly different specs out there into how we should convert a CSV column into an Array. IMO we should not try to solve all those use-cases ourselves and instead offer users the freedom to choose, as well as common utilities.

As such, one idea is to offer a way to plugin that allow users to parse CSV column into [Type]Array, and offer a default offering.

Since these are stateless, one simple idea is have the CSV reader accept a trait with two functions:

infer: Fn(rows: &[StringRecord]) -> Vec<Option<DataType>>;
convert: Fn(data_type: &DataType, rows: &[StringRecord], col_idx: usize) -> Result<ArrayRef>;
# or something like this

This signature indicates that:

  1. The function transverses rows
  2. the function is falible
  3. the resulting array is dynamic

This allows the user to e.g. make unparsable rows as nulls, adopt specific notations for CSV files that are (for them) interoperable with Arrow, etc.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 28, 2020
@Dandandan
Copy link
Contributor Author

Let's park this one for now.

@Dandandan Dandandan closed this Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Rust needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants