Support more additional formats (i.e. fst, feather, etc?) #3

cboettig · 2018-06-27T22:54:06Z

arkdb currently hardwires write_tsv/read_tsv for I/O. This should be pluggable.

done, now supports readr and base utils table I/O

Would be nice to take advantage of things like faster speed of fst::read_fst /fst::write_fst, but looks like fst does not support the ability to append, making it impossible to stream in chunks.

The text was updated successfully, but these errors were encountered:

richfitz · 2018-07-18T08:09:29Z

Relevant issue for fst: fstpackage/fst#9

richfitz · 2018-07-18T08:09:52Z

...and fstpackage/fst#24
fstpackage/fst#153
fstpackage/fst#91

cboettig · 2020-09-17T05:39:50Z

can also consider supporting non-DBI-based backends! e.g. LMDB, see example in https://github.com/cboettig/taxalight/

1beb · 2021-09-22T20:36:27Z

+arrow!

cboettig · 2021-09-22T23:03:11Z

@1beb can you clarify your use case here?

The arrow R package can already stream between many text-based formats (csv, tsv, json, etc) and 'native' binary formats (feather, parquet, etc) with minimal memory use. Are you trying to go from an arrow data format to a DBI-compliant database format or vice-versa?

1beb · 2021-09-22T23:12:33Z

Database <> Parquet would be nice.

…

On Wed, Sep 22, 2021, 19:03 Carl Boettiger ***@***.***> wrote: @1beb <https://github.com/1beb> can you clarify your use case here? The arrow R package can already stream between many text-based formats (csv, tsv, json, etc) and 'native' binary formats (feather, parquet, etc) with minimal memory use. Are you trying to go from an arrow data format to a DBI-compliant database format or vice-versa? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADG7Y36MMYNUAWF3MRRVODUDJODVANCNFSM4FHKAOUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

cboettig · 2021-09-23T00:10:27Z

Just a note: if you have parquet already, the most recent version of duckdb can give you a native full SQL / DBI interface directly to the parquet files, without even requiring an 'import' step: https://duckdb.org/docs/data/parquet. (If you have data in native duckdb database, it can also write it to parquet for you).

If you have data stuck in some other database (postgres etc) and want to export it to parquet, I hear you, unfortunately, parquet is a format that doesn't take well to processing in chunks (I believe a bunch of metadata is always placed at the end of the parquet file, so you can't stream / append to it, as far as I understand).

1beb · 2021-09-23T00:28:14Z

The latter is one of my use cases (db to parquet) my current strategy is breaking into variable based chunks then writing. But when you chunk, often field types get dropped so you have to be explicit. It would be nice to be able to stream to chunk_X.parquet, and be able to specify a row count. I'm trying to avoid writing code for offset/fetch SQL with my bare hands. Was thinking the `ark` function might work if the write method could be parameterized as a function argument. But that doesn't look like how it's designed. Another problem is that this targets an entire table, whereas at my scale I'd need to target subsets of a table partition to fit in RAM. I didn't notice a way to a filter. What's the target size of DB that this package is generally used for? PS: Great pointer to duckdb, I'm using it for some production cases and it's really great with parquet!

1beb · 2021-09-23T00:46:00Z

Looking at the code, I'm thinking that here: https://github.com/ropensci/arkdb/blob/master/R/ark.R#L224 you could probably add an optional write to a chunked parquet. You'd need to do some accounting for chunk naming if it mattered to you. I have a need - I'll take a swing at it, if I get it working I'll post it as a PR.

cboettig · 2021-09-23T03:01:52Z

@1beb sweet. yeah, definitely agree that a type-safe export format would be super nice, and parquet seems the logical choice.

in principle, you can ark to any output format for which you can define a custom streamable_table() function: https://github.com/ropensci/arkdb/blob/master/R/streamable_table.R, for .csv & friends, that's essentially just an append=TRUE. I suppose with parquet you can just write each chunk out as a separate parquet file perhaps? I'm not particular about the naming. Definitely let me know how it goes!

1beb · 2021-09-23T05:45:03Z

Ah! I see it now. Great. This looks much more straightforward than I was expecting. Are you aware of a package or script that does a good job of maintaining reasonable defaults between typical SQL schema and R's available types? Read a table, look at the schema, convert to colClasses.

Also a type-safe format isn't just nice it's required. Parquet files in a folder can be read with arrow::open_dataset(path) and it spits back if you have schema mismatches in your component files. A common example would be a logical vs. an integer where part of your db pull had sparse, missing data and the next part had at least one value.

cboettig · 2021-09-23T16:38:08Z

👍

Are you aware of a package or script that does a good job of maintaining reasonable defaults between typical SQL schema and R's available types? Read a table, look at the schema, convert to colClasses.

Not entirely sure I follow this. It should be built in to the database interface, right? e.g. DBI::dbReadTable() / DBI::dbWriteTable() etc handle the appropriate types.

Also a type-safe format isn't just nice it's required.

I totally here you on this. Still, UNIX philosophy is "plain text is the universal interface" and the permanent archive folks are still all in on it as binary formats which encode types come and go. Date/time encodings in particular have had a rough time of it. The classicist would say: no conflict between logical vs integer if you parse it all as text. ✋ but I hear you, I hear you, I live in the real world too and we need types. I think if we stick with DBI <-> parquet though this is all handled already pretty well though, right?

1beb · 2021-09-24T16:49:45Z

Note to self: https://cran.r-project.org/web/packages/DBI/vignettes/spec.html

Numbering output files instead of random string naming

cboettig mentioned this issue Jul 16, 2018

submission: arkdb ropensci/software-review#224

Closed

19 tasks

richfitz mentioned this issue Jul 18, 2018

Proof-of-concept pluggable i/o #5

Closed

6 tasks

cboettig mentioned this issue Jul 21, 2018

Rich FitzJohn Onboarding review #6

Closed

12 tasks

cboettig changed the title ~~Support additional formats~~ Support more additional formats (i.e. fst, feather, etc?) Aug 2, 2018

cboettig closed this as completed Sep 17, 2020

cboettig reopened this Sep 17, 2020

1beb mentioned this issue Sep 23, 2021

Adding parquet functionality and filter injection for ark. #40

Merged

11 tasks

cboettig pushed a commit that referenced this issue Nov 29, 2021

Merge pull request #3 from yougov-datascience/main

2a4b83d

Numbering output files instead of random string naming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support more additional formats (i.e. fst, feather, etc?) #3

Support more additional formats (i.e. fst, feather, etc?) #3

cboettig commented Jun 27, 2018 •

edited

Loading

richfitz commented Jul 18, 2018

richfitz commented Jul 18, 2018 •

edited

Loading

cboettig commented Sep 17, 2020

1beb commented Sep 22, 2021

cboettig commented Sep 22, 2021

1beb commented Sep 22, 2021 via email

cboettig commented Sep 23, 2021

1beb commented Sep 23, 2021 via email •

edited

Loading

1beb commented Sep 23, 2021

cboettig commented Sep 23, 2021 •

edited

Loading

1beb commented Sep 23, 2021 •

edited

Loading

cboettig commented Sep 23, 2021

1beb commented Sep 24, 2021

Support more additional formats (i.e. fst, feather, etc?) #3

Support more additional formats (i.e. fst, feather, etc?) #3

Comments

cboettig commented Jun 27, 2018 • edited Loading

richfitz commented Jul 18, 2018

richfitz commented Jul 18, 2018 • edited Loading

cboettig commented Sep 17, 2020

1beb commented Sep 22, 2021

cboettig commented Sep 22, 2021

1beb commented Sep 22, 2021 via email

cboettig commented Sep 23, 2021

1beb commented Sep 23, 2021 via email • edited Loading

1beb commented Sep 23, 2021

cboettig commented Sep 23, 2021 • edited Loading

1beb commented Sep 23, 2021 • edited Loading

cboettig commented Sep 23, 2021

1beb commented Sep 24, 2021

cboettig commented Jun 27, 2018 •

edited

Loading

richfitz commented Jul 18, 2018 •

edited

Loading

1beb commented Sep 23, 2021 via email •

edited

Loading

cboettig commented Sep 23, 2021 •

edited

Loading

1beb commented Sep 23, 2021 •

edited

Loading