From df2be02ab449411599b7569722063f5791ed0175 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Thu, 6 Oct 2022 12:25:02 -0700 Subject: [PATCH 1/7] Update news --- r/NEWS.md | 43 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/r/NEWS.md b/r/NEWS.md index c0bad9458d1..3f0bae56bed 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -17,7 +17,48 @@ under the License. --> -# arrow 9.0.0.9000 +# arrow 10.0.0 +## Arrays and tables + +`as_arrow_array()` can now take `blob::blob` and `?vctrs::list_of`, which +convert to binary and list arrays, respectively. Also fixed issue where +`as_arrow_array()` ignored type argument when passed a `StructArray`. + +The `unique()` function works on `?Table`, `?RecordBatch`, `?Dataset`, and +`?RecordBatchReader`. + +## Arrow dplyr queries + +Several new functions can be used in queries: + +* `dplyr::across()` can be used to apply the same computation across multiple + columns; +* `add_filename()` can be used to get the filename a row came from (only + available when querying `?Dataset`); +* Five functions in the `slice_*` family: `dplyr::slice_min()`, + `dplyr::slice_max()`, `dplyr::slice_head()`, `dplyr::slice_tail()`, and + `dplyr::slice_sample()`. + +A full list of functions available in queries is available at `?acero`. + +A few new features and bugfixes were implemented for joins. +Extension arrays are now supported in joins, allowing, for example, joining +datasets that contain [geoarrow](https://paleolimbot.github.io/geoarrow/) data. +The `keep` argument is now supported, allowing separate columns for the left +and right hand side join keys in join output. Full joins now coalesce the +join keys (when `keep = FALSE`), avoiding the issue where the join keys would +be all `NA` for rows in the right hand side without any matches on the left. + +A few breaking changes: Calling `dplyr::pull()` will return a `?ChunkedArray` +instead of an R vector. Calling `dplyr::compute()` on a query that is grouped +returns a `?Table`, instead of an query object. + +Finally, long-running queries can now be cancelled and will abort their +computation immediately. + +## Reading and writing + +`write_feather()` can take `FALSE` to choose writing uncompressed files. # arrow 9.0.0 From e6a83bc441eb1310474c03847824a98542ec344b Mon Sep 17 00:00:00 2001 From: Will Jones Date: Mon, 17 Oct 2022 12:07:35 -0700 Subject: [PATCH 2/7] chore: add installation instructions --- r/NEWS.md | 32 +++++++++++++++++++++++--------- 1 file changed, 23 insertions(+), 9 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index 3f0bae56bed..3301761f03d 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -17,15 +17,7 @@ under the License. --> -# arrow 10.0.0 -## Arrays and tables - -`as_arrow_array()` can now take `blob::blob` and `?vctrs::list_of`, which -convert to binary and list arrays, respectively. Also fixed issue where -`as_arrow_array()` ignored type argument when passed a `StructArray`. - -The `unique()` function works on `?Table`, `?RecordBatch`, `?Dataset`, and -`?RecordBatchReader`. +# arrow 9.0.0.9000 ## Arrow dplyr queries @@ -56,10 +48,32 @@ returns a `?Table`, instead of an query object. Finally, long-running queries can now be cancelled and will abort their computation immediately. +## Arrays and tables + +`as_arrow_array()` can now take `blob::blob` and `?vctrs::list_of`, which +convert to binary and list arrays, respectively. Also fixed issue where +`as_arrow_array()` ignored type argument when passed a `StructArray`. + +The `unique()` function works on `?Table`, `?RecordBatch`, `?Dataset`, and +`?RecordBatchReader`. + ## Reading and writing `write_feather()` can take `FALSE` to choose writing uncompressed files. +## Installation + +As of version 10.0.0, `arrow` requires C++17 to build. This means that: + +* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to support + R 3.6. +* On CentOS 7, you can build the latest version of `arrow`, + but you first need to install a newer compiler than the default system compiler, + gcc 4.8. See `vignette("install", package = "arrow")` for guidance. + Note that you only need the newer compiler to build `arrow`: + installing a binary package, as from RStudio Package Manager, + or loading a package you've already installed works fine with the system defaults. + # arrow 9.0.0 ## Arrow dplyr queries From 7cf55f26241ed7905edb0dee4162a8290c7ba8e1 Mon Sep 17 00:00:00 2001 From: Will Jones Date: Mon, 17 Oct 2022 12:46:08 -0700 Subject: [PATCH 3/7] Apply suggestions from code review Co-authored-by: Dewey Dunnington --- r/NEWS.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index 3301761f03d..0e04b827b48 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -27,7 +27,7 @@ Several new functions can be used in queries: columns; * `add_filename()` can be used to get the filename a row came from (only available when querying `?Dataset`); -* Five functions in the `slice_*` family: `dplyr::slice_min()`, +* Added five functions in the `slice_*` family: `dplyr::slice_min()`, `dplyr::slice_max()`, `dplyr::slice_head()`, `dplyr::slice_tail()`, and `dplyr::slice_sample()`. @@ -51,7 +51,7 @@ computation immediately. ## Arrays and tables `as_arrow_array()` can now take `blob::blob` and `?vctrs::list_of`, which -convert to binary and list arrays, respectively. Also fixed issue where +convert to binary and list arrays, respectively. Also fixed an issue where `as_arrow_array()` ignored type argument when passed a `StructArray`. The `unique()` function works on `?Table`, `?RecordBatch`, `?Dataset`, and From 0a7f7504cb0231bed90c789ddab4ec6b9045460d Mon Sep 17 00:00:00 2001 From: Will Jones Date: Mon, 17 Oct 2022 14:08:17 -0700 Subject: [PATCH 4/7] docs: note the breaking change in IPC datasets --- r/NEWS.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/r/NEWS.md b/r/NEWS.md index 0e04b827b48..4172173d057 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -61,6 +61,10 @@ The `unique()` function works on `?Table`, `?RecordBatch`, `?Dataset`, and `write_feather()` can take `FALSE` to choose writing uncompressed files. +Also, a breaking change for IPC files in `write_dataset()`: passing +`"ipc"` or `"feather"` to `format` will now write files with `.arrow` +extension instead of `.feather`. + ## Installation As of version 10.0.0, `arrow` requires C++17 to build. This means that: From 38bd2227d4d1795b0694c81f0b191f761a0e23bb Mon Sep 17 00:00:00 2001 From: Will Jones Date: Tue, 18 Oct 2022 09:25:50 -0700 Subject: [PATCH 5/7] Update r/NEWS.md --- r/NEWS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/NEWS.md b/r/NEWS.md index 4172173d057..f5a272c81ad 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -63,7 +63,7 @@ The `unique()` function works on `?Table`, `?RecordBatch`, `?Dataset`, and Also, a breaking change for IPC files in `write_dataset()`: passing `"ipc"` or `"feather"` to `format` will now write files with `.arrow` -extension instead of `.feather`. +extension instead of `.ipc` or `.feather`. ## Installation From 6131563910eef0926a652cdb730ac4053da3a96f Mon Sep 17 00:00:00 2001 From: Will Jones Date: Tue, 18 Oct 2022 10:33:19 -0700 Subject: [PATCH 6/7] Apply suggestions from code review --- r/NEWS.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index f5a272c81ad..4dcefc14195 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -31,7 +31,7 @@ Several new functions can be used in queries: `dplyr::slice_max()`, `dplyr::slice_head()`, `dplyr::slice_tail()`, and `dplyr::slice_sample()`. -A full list of functions available in queries is available at `?acero`. +For a full list of functions available in queries see `?acero`. A few new features and bugfixes were implemented for joins. Extension arrays are now supported in joins, allowing, for example, joining @@ -43,7 +43,7 @@ be all `NA` for rows in the right hand side without any matches on the left. A few breaking changes: Calling `dplyr::pull()` will return a `?ChunkedArray` instead of an R vector. Calling `dplyr::compute()` on a query that is grouped -returns a `?Table`, instead of an query object. +returns a `?Table`, instead of a query object. Finally, long-running queries can now be cancelled and will abort their computation immediately. From 7cfabc6862c5e46e4ff293a54b8f222a73b89369 Mon Sep 17 00:00:00 2001 From: Neal Richardson Date: Thu, 20 Oct 2022 09:39:22 -0400 Subject: [PATCH 7/7] NEWS revisions and update acero docs --- r/NEWS.md | 44 ++++++++++++++++++++++++------------------- r/R/dplyr-funcs-doc.R | 2 +- r/man/acero.Rd | 2 +- 3 files changed, 27 insertions(+), 21 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index 4dcefc14195..e7dcee6b9d2 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -21,37 +21,43 @@ ## Arrow dplyr queries -Several new functions can be used in queries: +Several new functions can be used in queries: -* `dplyr::across()` can be used to apply the same computation across multiple - columns; -* `add_filename()` can be used to get the filename a row came from (only +* `dplyr::across()` can be used to apply the same computation across multiple + columns, and the `where()` selection helper is supported in `across()`; +* `add_filename()` can be used to get the filename a row came from (only available when querying `?Dataset`); -* Added five functions in the `slice_*` family: `dplyr::slice_min()`, +* Added five functions in the `slice_*` family: `dplyr::slice_min()`, `dplyr::slice_max()`, `dplyr::slice_head()`, `dplyr::slice_tail()`, and `dplyr::slice_sample()`. -For a full list of functions available in queries see `?acero`. +The package now has documentation that lists all `dplyr` methods and R function +mappings that are supported on Arrow data, along with notes about any +differences in functionality between queries evaluated in R versus in Acero, the +Arrow query engine. See `?acero`. -A few new features and bugfixes were implemented for joins. -Extension arrays are now supported in joins, allowing, for example, joining -datasets that contain [geoarrow](https://paleolimbot.github.io/geoarrow/) data. -The `keep` argument is now supported, allowing separate columns for the left -and right hand side join keys in join output. Full joins now coalesce the -join keys (when `keep = FALSE`), avoiding the issue where the join keys would -be all `NA` for rows in the right hand side without any matches on the left. +A few new features and bugfixes were implemented for joins: -A few breaking changes: Calling `dplyr::pull()` will return a `?ChunkedArray` -instead of an R vector. Calling `dplyr::compute()` on a query that is grouped -returns a `?Table`, instead of a query object. +* Extension arrays are now supported in joins, allowing, for example, joining + datasets that contain [geoarrow](https://paleolimbot.github.io/geoarrow/) data. +* The `keep` argument is now supported, allowing separate columns for the left + and right hand side join keys in join output. Full joins now coalesce the + join keys (when `keep = FALSE`), avoiding the issue where the join keys would + be all `NA` for rows in the right hand side without any matches on the left. -Finally, long-running queries can now be cancelled and will abort their +A few breaking changes that improve the consistency of the API: + +* Calling `dplyr::pull()` will return a `?ChunkedArray` instead of an R vector. +* Calling `dplyr::compute()` on a query that is grouped + returns a `?Table`, instead of a query object. + +Finally, long-running queries can now be cancelled and will abort their computation immediately. ## Arrays and tables `as_arrow_array()` can now take `blob::blob` and `?vctrs::list_of`, which -convert to binary and list arrays, respectively. Also fixed an issue where +convert to binary and list arrays, respectively. Also fixed an issue where `as_arrow_array()` ignored type argument when passed a `StructArray`. The `unique()` function works on `?Table`, `?RecordBatch`, `?Dataset`, and @@ -59,7 +65,7 @@ The `unique()` function works on `?Table`, `?RecordBatch`, `?Dataset`, and ## Reading and writing -`write_feather()` can take `FALSE` to choose writing uncompressed files. +`write_feather()` can take `compression = FALSE` to choose writing uncompressed files. Also, a breaking change for IPC files in `write_dataset()`: passing `"ipc"` or `"feather"` to `format` will now write files with `.arrow` diff --git a/r/R/dplyr-funcs-doc.R b/r/R/dplyr-funcs-doc.R index e1aaa2e12fd..eb0f5822017 100644 --- a/r/R/dplyr-funcs-doc.R +++ b/r/R/dplyr-funcs-doc.R @@ -83,7 +83,7 @@ #' Functions can be called either as `pkg::fun()` or just `fun()`, i.e. both #' `str_sub()` and `stringr::str_sub()` work. #' -#' In addition to these functions, you can call any of Arrow's 244 compute +#' In addition to these functions, you can call any of Arrow's 243 compute #' functions directly. Arrow has many functions that don't map to an existing R #' function. In other cases where there is an R function mapping, you can still #' call the Arrow function directly if you don't want the adaptations that the R diff --git a/r/man/acero.Rd b/r/man/acero.Rd index 45afebd336b..d340c2cbd8e 100644 --- a/r/man/acero.Rd +++ b/r/man/acero.Rd @@ -68,7 +68,7 @@ can assume that the function works in Acero just as it does in R. Functions can be called either as \code{pkg::fun()} or just \code{fun()}, i.e. both \code{str_sub()} and \code{stringr::str_sub()} work. -In addition to these functions, you can call any of Arrow's 244 compute +In addition to these functions, you can call any of Arrow's 243 compute functions directly. Arrow has many functions that don't map to an existing R function. In other cases where there is an R function mapping, you can still call the Arrow function directly if you don't want the adaptations that the R