diff --git a/codemeta.json b/codemeta.json index 6572449..79b1fc6 100644 --- a/codemeta.json +++ b/codemeta.json @@ -237,7 +237,7 @@ ], "releaseNotes": "https://github.com/ropensci/arkdb/blob/master/NEWS.md", "readme": "https://github.com/ropensci/arkdb/blob/master/README.md", - "fileSize": "19.156KB", + "fileSize": "19.162KB", "contIntegration": [ "https://travis-ci.org/cboettig/arkdb", "https://codecov.io/github/cboettig/arkdb?branch=master", diff --git a/docs/CODE_OF_CONDUCT.html b/docs/CODE_OF_CONDUCT.html index e108acf..ec084ba 100644 --- a/docs/CODE_OF_CONDUCT.html +++ b/docs/CODE_OF_CONDUCT.html @@ -86,9 +86,6 @@
vignettes/arkdb.Rmd
arkdb.Rmd
You can install arkdb
from GitHub with:
First, we’ll need an example database to work with. Conveniently, there is a nice example using the NYC flights data built into the dbplyr
package.
tmp <- tempdir() # Or can be your working directory, "."
db <- dbplyr::nycflights13_sqlite(tmp)
-#> Caching nycflights db at /var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpTO4dP6/nycflights13.sqlite
+#> Caching nycflights db at /var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//RtmpHSUUIU/nycflights13.sqlite
#> Creating table: airlines
#> Creating table: airports
#> Creating table: flights
@@ -127,15 +140,15 @@
dir <- fs::dir_create(fs::path(tmp, "nycflights"))
ark(db, dir, lines = 50000)
#> Exporting airlines in 50000 line chunks:
-#> ...Done! (in 0.02021909 secs)
+#> ...Done! (in 0.006599903 secs)
#> Exporting airports in 50000 line chunks:
-#> ...Done! (in 0.0284009 secs)
+#> ...Done! (in 0.02018189 secs)
#> Exporting flights in 50000 line chunks:
-#> ...Done! (in 17.91478 secs)
+#> ...Done! (in 11.12319 secs)
#> Exporting planes in 50000 line chunks:
-#> ...Done! (in 0.06093788 secs)
+#> ...Done! (in 0.03582788 secs)
#> Exporting weather in 50000 line chunks:
-#> ...Done! (in 1.068066 secs)
+#> ...Done! (in 0.7464259 secs)
We can take a look and confirm the files have been written. Note that we can use fs::dir_info
to get a nice snapshot of the file sizes. Compare the compressed sizes to the original database:
fs::dir_info(dir) %>%
select(path, size) %>%
@@ -162,20 +175,20 @@
As with ark
, we can set the chunk size to control the memory footprint required:
unark(files, new_db, lines = 50000)
#> Importing airlines.tsv.bz2 in 50000 line chunks:
-#> ...Done! (in 0.01442003 secs)
+#> ...Done! (in 0.01069999 secs)
#> Importing airports.tsv.bz2 in 50000 line chunks:
-#> ...Done! (in 0.05904579 secs)
+#> ...Done! (in 0.0232439 secs)
#> Importing flights.tsv.bz2 in 50000 line chunks:
-#> ...Done! (in 11.08024 secs)
+#> ...Done! (in 7.854657 secs)
#> Importing planes.tsv.bz2 in 50000 line chunks:
-#> ...Done! (in 0.05352783 secs)
+#> ...Done! (in 0.03870487 secs)
#> Importing weather.tsv.bz2 in 50000 line chunks:
-#> ...Done! (in 0.456851 secs)
+#> ...Done! (in 0.376317 secs)
unark
returns a dplyr
database connection that we can use in the usual way:
tbl(new_db, "flights")
#> # Source: table<flights> [?? x 19]
#> # Database: sqlite 3.22.0
-#> # [/var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T/RtmpTO4dP6/local.sqlite]
+#> # [/var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T/RtmpHSUUIU/local.sqlite]
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2 830
@@ -207,30 +220,30 @@
ark(db, dir,
streamable_table = streamable_base_csv())
#> Exporting airlines in 50000 line chunks:
-#> ...Done! (in 0.004613161 secs)
+#> ...Done! (in 0.004092932 secs)
#> Exporting airports in 50000 line chunks:
-#> ...Done! (in 0.02630806 secs)
+#> ...Done! (in 0.02115822 secs)
#> Exporting flights in 50000 line chunks:
-#> ...Done! (in 17.27024 secs)
+#> ...Done! (in 12.96726 secs)
#> Exporting planes in 50000 line chunks:
-#> ...Done! (in 0.05888391 secs)
+#> ...Done! (in 0.03623223 secs)
#> Exporting weather in 50000 line chunks:
-#> ...Done! (in 1.362757 secs)
files <- fs::dir_ls(dir, glob = "*.csv.bz2")
new_db <- src_sqlite(fs::path(tmp,"local.sqlite"), create=TRUE)
unark(files, new_db,
streamable_table = streamable_base_csv())
#> Importing airlines.csv.bz2 in 50000 line chunks:
-#> ...Done! (in 0.01182294 secs)
+#> ...Done! (in 0.0100081 secs)
#> Importing airports.csv.bz2 in 50000 line chunks:
-#> ...Done! (in 0.03524399 secs)
+#> ...Done! (in 0.04043198 secs)
#> Importing flights.csv.bz2 in 50000 line chunks:
-#> ...Done! (in 10.80483 secs)
+#> ...Done! (in 8.368721 secs)
#> Importing planes.csv.bz2 in 50000 line chunks:
-#> ...Done! (in 0.05175495 secs)
+#> ...Done! (in 0.04479718 secs)
#> Importing weather.csv.bz2 in 50000 line chunks:
-#> ...Done! (in 0.427536 secs)
arkdb
also provides the function streamable_table()
to facilitate users creating their own streaming table interfaces. For instance, if you would prefer to use readr
methods to read and write tsv
files, we could construct the table as follows (streamable_readr_tsv()
and streamable_readr_csv()
are also shipped inside arkdb
for convenience):
stream <-
streamable_table(
@@ -242,15 +255,15 @@
ark(db, dir,
streamable_table = stream)
#> Exporting airlines in 50000 line chunks:
-#> ...Done! (in 0.04044104 secs)
+#> ...Done! (in 0.02484703 secs)
#> Exporting airports in 50000 line chunks:
-#> ...Done! (in 0.032516 secs)
+#> ...Done! (in 0.01741886 secs)
#> Exporting flights in 50000 line chunks:
-#> ...Done! (in 10.92277 secs)
+#> ...Done! (in 8.449388 secs)
#> Exporting planes in 50000 line chunks:
-#> ...Done! (in 0.07039022 secs)
+#> ...Done! (in 0.05046201 secs)
#> Exporting weather in 50000 line chunks:
-#> ...Done! (in 0.6085448 secs)
+#> ...Done! (in 0.5301778 secs)
Note several constraints on this design. The write method must be able to take a generic R connection
object (which will allow it to handle the compression methods used, if any), and the read method must be able to take a textConnection
object. readr
functions handle these cases out of the box, so the above method is easy to write. Also note that the write method must be able to append
, i.e. it should use a header if append=TRUE
, but omit when it is FALSE
. See the built-in methods for more examples.
vignettes/articles/blog-draft.Rmd
+ blog-draft.Rmd
Over the past summer, I have written two small-ish R packages to address challenges I frequently run up against during the course of my research. Both are challenges with what I will refer to as medium-sized data – not the kind of petabyte scale “big data” which precludes analysis on standard hardware or existing methodology, but large enough that the size alone starts creating problems for certain bits of a typical workflow. More precisely, I will take medium-sized to refer to data that is too large to comfortably fit in memory on most laptops (e.g. on the order of several GB), or data that is merely too large to commit to GitHub. By typical workflow, I mean easily being able to share all parts of analysis publicly or privately with collaborators (or merely different machines, such as my laptop and cloud server) who should be able to reproduce the results with minimal fuss and configuration.
+For data too large to fit into memory, there’s already a well-established solution of using an external database, to store the data. Thanks to dplyr
’s database backends, many R users can adapt their workflow relatively seamlessly to move from dplyr
commands that call in-memory data frames to identical or nearly identical commands that call a database. This all works pretty well when your data is already in a database, but getting it onto a database, and then moving the data around so that other people/machines can access it is not nearly so straight forward. So far, this part of the problem has received relatively little attention.
The reason for that is because the usual response to this problem is “you’re doing it wrong.” The standard practice in this context is simply not to move the data at all. A central database server, usually with access controlled by password or other credential, can allow multiple users to all query the same database directly. Thanks to the magical abstractions of SQL queries such as the DBI
package, the user (aka client), doesn’t need to care about the details of where the database is located, or even what particular backend is used. Moving all that data around can be slow and expensive. Arbitrarily large data can be housed in a central/cloud location and provisioned with enough resources to store everything and process complex queries. Consequently, just about every database backend not only to provides a mechanism for doing your SQL
/ dplyr
querying, filtering, joining etc on data that cannot fit into memory all at once, but also nearly every such backend provides server abilities to do so over a network connection, handling secure logins and so forth. Why would you want to do anything else?
The problem with the usual response is that it is often at odds with our original objectives and typical scientific workflows. Setting up a database server can be non-trivial; by which I mean: difficult to automate in a portable/cross-platform manner when working entirely from R. More importantly, it reflects a use-case more typical of industry context than scientific practice. Individual researchers need to make data available to a global community of scientists who can reproduce results years or decades later; not just to a handful of employees who can be granted authenticated access to a central database. Archiving data as static text files is far more scalable, more cost-effective (storing static files is much cheaper than keeping a database server running), more future-proof (rapid evolution in database technology is not always backwards compatible) and simplifies or avoids most security issues involved in maintaining a public server. In the scientific context, it almost always makes more sense to move the data after all.
+Scientific data repositories are already built on precisely this model: providing long term storage of files that can be downloaded and analyzed locally. For smaller .csv
files, this works pretty well.
We just wanted to access data that was a bit larger than our active memory. There is in fact a widely-used solution to this case: the Lite
flavors of databases like SQLite
, or my new favorite, MonetDBLite
, which provide the disk-based storage but not the support for network connection server model. Using the corresponding R packages, these databases can be easily deployed to store & query data on our local disk.
vignettes/articles/noapi.Rmd
+ noapi.Rmd
Over the past summer, I have written two small-ish R packages to address challenges I frequently run up against during the course of my research. Both are challenges with what I will refer to as medium-sized data – not the kind of petabyte scale “big data” which precludes analysis on standard hardware or existing methodology, but large enough that the size alone starts creating problems for certain bits of a typical workflow. More precisely, I will take medium-sized to refer to data that is too large to comfortably fit in memory on most laptops (e.g. on the order of several GB), or data that is merely too large to commit to GitHub. By typical workflow, I mean easily being able to share all parts of analysis publicly or privately with collaborators (or merely different machines, such as my laptop and cloud server) who should be able to reproduce the results with minimal fuss and configuration.
+For data too large to fit into memory, there’s already a well-established solution of using an external database, to store the data. Thanks to dplyr
’s database backends, many R users can adapt their workflow relatively seamlessly to move from dplyr
commands that call in-memory data frames to identical or nearly identical commands that call a database. This all works pretty well when your data is already in a database, but getting it onto a database, and then moving the data around so that other people/machines can access it is not nearly so straight forward. So far, this part of the problem has received relatively little attention.
The reason for that is because the usual response to this problem is “you’re doing it wrong.” The standard practice in this context is simply not to move the data at all. A central database server, usually with access controlled by password or other credential, can allow multiple users to all query the same database directly. Thanks to the magical abstractions of SQL queries such as the DBI
package, the user (aka client), doesn’t need to care about the details of where the database is located, or even what particular backend is used. Moving all that data around can be slow and expensive. Arbitrarily large data can be housed in a central/cloud location and provisioned with enough resources to store everything and process complex queries. Consequently, just about every database backend not only to provides a mechanism for doing your SQL
/ dplyr
querying, filtering, joining etc on data that cannot fit into memory all at once, but also nearly every such backend provides server abilities to do so over a network connection, handling secure logins and so forth. Why would you want to do anything else?
The problem with the usual response is that it is often at odds with our original objectives and typical scientific workflows. Setting up a database server can be non-trivial; by which I mean: difficult to automate in a portable/cross-platform manner when working entirely from R. More importantly, it reflects a use-case more typical of industry context than scientific practice. Individual researchers need to make data available to a global community of scientists who can reproduce results years or decades later; not just to a handful of employees who can be granted authenticated access to a central database. Archiving data as static text files is far more scalable, more cost-effective (storing static files is much cheaper than keeping a database server running), more future-proof (rapid evolution in database technology is not always backwards compatible) and simplifies or avoids most security issues involved in maintaining a public server. In the scientific context, it almost always makes more sense to move the data after all.
+Scientific data repositories are already built on precisely this model: providing long term storage of files that can be downloaded and analyzed locally. For smaller .csv
files, this works pretty well.
We just wanted to access data that was a bit larger than our active memory. There is in fact a widely-used solution to this case: the Lite
flavors of databases like SQLite
, or my new favorite, MonetDBLite
, which provide the disk-based storage but not the support for network connection server model. Using the corresponding R packages, these databases can be easily deployed to store & query data on our local disk.