Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more additional formats (i.e. fst, feather, etc?) #3

Open
1 task done
cboettig opened this issue Jun 27, 2018 · 13 comments
Open
1 task done

Support more additional formats (i.e. fst, feather, etc?) #3

cboettig opened this issue Jun 27, 2018 · 13 comments

Comments

@cboettig
Copy link
Member

cboettig commented Jun 27, 2018

  • arkdb currently hardwires write_tsv/read_tsv for I/O. This should be pluggable.

done, now supports readr and base utils table I/O

Would be nice to take advantage of things like faster speed of fst::read_fst /fst::write_fst, but looks like fst does not support the ability to append, making it impossible to stream in chunks.

@richfitz
Copy link
Member

Relevant issue for fst: fstpackage/fst#9

@richfitz
Copy link
Member

richfitz commented Jul 18, 2018

@cboettig cboettig changed the title Support additional formats Support more additional formats (i.e. fst, feather, etc?) Aug 2, 2018
@cboettig
Copy link
Member Author

can also consider supporting non-DBI-based backends! e.g. LMDB, see example in https://github.com/cboettig/taxalight/

@cboettig cboettig reopened this Sep 17, 2020
@1beb
Copy link
Contributor

1beb commented Sep 22, 2021

+arrow!

@cboettig
Copy link
Member Author

@1beb can you clarify your use case here?

The arrow R package can already stream between many text-based formats (csv, tsv, json, etc) and 'native' binary formats (feather, parquet, etc) with minimal memory use. Are you trying to go from an arrow data format to a DBI-compliant database format or vice-versa?

@1beb
Copy link
Contributor

1beb commented Sep 22, 2021 via email

@cboettig
Copy link
Member Author

Just a note: if you have parquet already, the most recent version of duckdb can give you a native full SQL / DBI interface directly to the parquet files, without even requiring an 'import' step: https://duckdb.org/docs/data/parquet. (If you have data in native duckdb database, it can also write it to parquet for you).

If you have data stuck in some other database (postgres etc) and want to export it to parquet, I hear you, unfortunately, parquet is a format that doesn't take well to processing in chunks (I believe a bunch of metadata is always placed at the end of the parquet file, so you can't stream / append to it, as far as I understand).

@1beb
Copy link
Contributor

1beb commented Sep 23, 2021 via email

@1beb
Copy link
Contributor

1beb commented Sep 23, 2021

Looking at the code, I'm thinking that here: https://github.com/ropensci/arkdb/blob/master/R/ark.R#L224 you could probably add an optional write to a chunked parquet. You'd need to do some accounting for chunk naming if it mattered to you. I have a need - I'll take a swing at it, if I get it working I'll post it as a PR.

@cboettig
Copy link
Member Author

cboettig commented Sep 23, 2021

@1beb sweet. yeah, definitely agree that a type-safe export format would be super nice, and parquet seems the logical choice.

in principle, you can ark to any output format for which you can define a custom streamable_table() function: https://github.com/ropensci/arkdb/blob/master/R/streamable_table.R, for .csv & friends, that's essentially just an append=TRUE. I suppose with parquet you can just write each chunk out as a separate parquet file perhaps? I'm not particular about the naming. Definitely let me know how it goes!

@1beb
Copy link
Contributor

1beb commented Sep 23, 2021

Ah! I see it now. Great. This looks much more straightforward than I was expecting. Are you aware of a package or script that does a good job of maintaining reasonable defaults between typical SQL schema and R's available types? Read a table, look at the schema, convert to colClasses.

Also a type-safe format isn't just nice it's required. Parquet files in a folder can be read with arrow::open_dataset(path) and it spits back if you have schema mismatches in your component files. A common example would be a logical vs. an integer where part of your db pull had sparse, missing data and the next part had at least one value.

@cboettig
Copy link
Member Author

👍

Are you aware of a package or script that does a good job of maintaining reasonable defaults between typical SQL schema and R's available types? Read a table, look at the schema, convert to colClasses.

Not entirely sure I follow this. It should be built in to the database interface, right? e.g. DBI::dbReadTable() / DBI::dbWriteTable() etc handle the appropriate types.

Also a type-safe format isn't just nice it's required.

I totally here you on this. Still, UNIX philosophy is "plain text is the universal interface" and the permanent archive folks are still all in on it as binary formats which encode types come and go. Date/time encodings in particular have had a rough time of it. The classicist would say: no conflict between logical vs integer if you parse it all as text. ✋ but I hear you, I hear you, I live in the real world too and we need types. I think if we stick with DBI <-> parquet though this is all handled already pretty well though, right?

@1beb
Copy link
Contributor

1beb commented Sep 24, 2021

cboettig pushed a commit that referenced this issue Nov 29, 2021
Numbering output files instead of random string naming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants