Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3814: [R] RecordBatch$from_arrays() #3565

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
9a0f996
schema() supports tidy dots splicing, using rlang::list2
romainfrancois Feb 5, 2019
cd03e19
+ list_to_shared_ptr_vector
romainfrancois Feb 5, 2019
d49906a
Change record_batch() api so that it takes ... and schema.
romainfrancois Feb 5, 2019
c71d872
update docs
romainfrancois Feb 5, 2019
20c5ce6
move the logic of `RecordBatch__from_arrays` internally.
romainfrancois Feb 6, 2019
c5ad626
retire RecordBatch__from_dataframe() function, no longer needed and r…
romainfrancois Feb 6, 2019
c00774a
table() factory also handles ... and !!! a schema, similar to record_…
romainfrancois Feb 6, 2019
41b496f
table(...) cab now either handle ... being:
romainfrancois Feb 7, 2019
7e8a4b7
test for table(...<batches>)
romainfrancois Feb 7, 2019
68744f5
tests for table(...<vectors, arrays, chunked arrays>)
romainfrancois Feb 7, 2019
d8e627f
use the schema= argument in table()
romainfrancois Feb 8, 2019
d958108
record_batch(..., schema = )
romainfrancois Mar 6, 2019
fc885fd
directly return from builder_->Finish(), as suggested here: https://g…
romainfrancois Mar 6, 2019
c04f904
tests about record_batch(schema=) argument
romainfrancois Mar 6, 2019
4574942
STOP_IF migth be useful too
romainfrancois Mar 6, 2019
efd84c5
record_batch(schema=) compares names
romainfrancois Mar 6, 2019
eed535f
add comments about !!!
romainfrancois Mar 7, 2019
2362ea0
typo
romainfrancois Mar 7, 2019
f27dcb9
Also run cpplint and clang-format on .cpp files
wesm Feb 28, 2019
6ea0778
rebase
romainfrancois Mar 8, 2019
ab0cd16
only pass $1 to run_clang_format so that we can do:
romainfrancois Mar 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions r/R/RcppExports.R

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 6 additions & 3 deletions r/R/RecordBatch.R
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,13 @@

#' Create an [arrow::RecordBatch][arrow__RecordBatch] from a data frame
#'
#' @param .data a data frame
#' @param ... A variable number of arrow::Array
#' @param schema a arrow::Schema
#'
#' @return a [arrow::RecordBatch][arrow__RecordBatch]
#' @export
record_batch <- function(.data){
shared_ptr(`arrow::RecordBatch`, RecordBatch__from_dataframe(.data))
record_batch <- function(..., schema = NULL){
arrays <- tibble::lst(...)
stopifnot(length(arrays) > 0)
shared_ptr(`arrow::RecordBatch`, RecordBatch__from_arrays(schema, arrays))
}
2 changes: 1 addition & 1 deletion r/R/Schema.R
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
#'
#' @export
schema <- function(...){
shared_ptr(`arrow::Schema`, schema_(.fields(list(...))))
shared_ptr(`arrow::Schema`, schema_(.fields(list2(...))))
}

#' read a Schema from a stream
Expand Down
11 changes: 8 additions & 3 deletions r/R/Table.R
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,16 @@

#' Create an arrow::Table from a data frame
#'
#' @param .data a data frame
#' @param ... arrays, chunked arrays, or R vectors
#' @param schema a schema. The default (`NULL`) infers the schema from the `...`
#'
#' @return an arrow::Table
#'
#' @export
table <- function(.data){
shared_ptr(`arrow::Table`, Table__from_dataframe(.data))
table <- function(..., schema = NULL){
dots <- tibble::lst(...)
stopifnot(length(dots) > 0)
shared_ptr(`arrow::Table`, Table__from_dots(dots, schema))
}

#' @export
Expand Down
7 changes: 6 additions & 1 deletion r/R/feather.R
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,12 @@ write_feather.default <- function(data, stream) {

#' @export
write_feather.data.frame <- function(data, stream) {
write_feather(record_batch(data), stream)
# splice the columns in the record_batch() call
# e.g. if we had data <- data.frame(x = <...>, y = <...>)
# then record_batch(!!!data) is the same as
# record_batch(x = data$x, y = data$y)
# see ?rlang::list2()
write_feather(record_batch(!!!data), stream)
romainfrancois marked this conversation as resolved.
Show resolved Hide resolved
}

#' @method write_feather arrow::RecordBatch
Expand Down
5 changes: 4 additions & 1 deletion r/R/write_arrow.R
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@ to_arrow <- function(x) {

`to_arrow.arrow::RecordBatch` <- function(x) x
`to_arrow.arrow::Table` <- function(x) x
`to_arrow.data.frame` <- function(x) table(x)

# splice the data frame as arguments of table()
# see ?rlang::list2()
`to_arrow.data.frame` <- function(x) table(!!!x)
romainfrancois marked this conversation as resolved.
Show resolved Hide resolved

#' serialize an [arrow::Table][arrow__Table], an [arrow::RecordBatch][arrow__RecordBatch], or a
#' data frame to either the streaming format or the binary file format
Expand Down
2 changes: 1 addition & 1 deletion r/lint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@ CPPLINT=$CPP_BUILD_SUPPORT/cpplint.py
$CPP_BUILD_SUPPORT/run_cpplint.py \
--cpplint_binary=$CPPLINT \
--exclude_glob=$CPP_BUILD_SUPPORT/lint_exclusions.txt \
--source_dir=$SOURCE_DIR/src --quiet $1
--source_dir=$SOURCE_DIR/src --quiet
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we need to pass $1 to run_cpp_lint.py too ?

Removing it here allows to use

./r/lint.sh --fix

and let the tool fix the format

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay

6 changes: 4 additions & 2 deletions r/man/record_batch.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 4 additions & 2 deletions r/man/table.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

50 changes: 26 additions & 24 deletions r/src/RcppExports.cpp

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 3 additions & 4 deletions r/src/array_from_vector.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -270,8 +270,7 @@ class VectorConverter {
virtual Status Ingest(SEXP obj) = 0;

virtual Status GetResult(std::shared_ptr<arrow::Array>* result) {
RETURN_NOT_OK(builder_->Finish(result));
return Status::OK();
return builder_->Finish(result);
}

ArrayBuilder* builder() const { return builder_; }
Expand Down Expand Up @@ -299,7 +298,7 @@ struct Unbox<Type, enable_if_integer<Type>> {
return IngestRange<int64_t>(builder, reinterpret_cast<int64_t*>(REAL(obj)),
XLENGTH(obj), NA_INT64);
}
// TODO: handle aw and logical
// TODO: handle raw and logical
default:
break;
}
Expand Down Expand Up @@ -881,7 +880,7 @@ std::shared_ptr<Array> MakeSimpleArray(SEXP x) {
}

auto data = ArrayData::Make(std::make_shared<Type>(), LENGTH(x), std::move(buffers),
null_count, 0);
null_count, 0 /*offset*/);

// return the right Array class
return std::make_shared<typename TypeTraits<Type>::ArrayType>(data);
Expand Down
24 changes: 19 additions & 5 deletions r/src/arrow_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

#include <limits>
#include <memory>
#include <vector>

#include <RcppCommon.h>

Expand All @@ -35,11 +36,12 @@
#include <arrow/type.h>
#include <arrow/util/compression.h>

#define STOP_IF_NOT(TEST, MSG) \
do { \
if (!(TEST)) Rcpp::stop(MSG); \
#define STOP_IF(TEST, MSG) \
do { \
if (TEST) Rcpp::stop(MSG); \
} while (0)

#define STOP_IF_NOT(TEST, MSG) STOP_IF(!(TEST), MSG)
#define STOP_IF_NOT_OK(s) STOP_IF_NOT(s.ok(), s.ToString())

template <typename T>
Expand Down Expand Up @@ -178,8 +180,7 @@ inline constexpr Rbyte default_value<RAWSXP>() {
SEXP ChunkedArray__as_vector(const std::shared_ptr<arrow::ChunkedArray>& chunked_array);
SEXP Array__as_vector(const std::shared_ptr<arrow::Array>& array);
std::shared_ptr<arrow::Array> Array__from_vector(SEXP x, SEXP type);
std::shared_ptr<arrow::RecordBatch> RecordBatch__from_dataframe(Rcpp::DataFrame tbl);
std::shared_ptr<arrow::DataType> Array__infer_type(SEXP x);
std::shared_ptr<arrow::RecordBatch> RecordBatch__from_arrays(SEXP, SEXP);

namespace arrow {
namespace r {
Expand Down Expand Up @@ -207,5 +208,18 @@ inline std::shared_ptr<T> extract(SEXP x) {
return Rcpp::ConstReferenceSmartPtrInputParameter<std::shared_ptr<T>>(x);
}

template <typename T>
std::vector<std::shared_ptr<T>> list_to_shared_ptr_vector(SEXP lst) {
R_xlen_t n = XLENGTH(lst);
std::vector<std::shared_ptr<T>> res(n);
for (R_xlen_t i = 0; i < n; i++) {
res[i] = extract<T>(VECTOR_ELT(lst, i));
}
return res;
}

std::shared_ptr<arrow::Array> Array__from_vector(
SEXP x, const std::shared_ptr<arrow::DataType>& type, bool type_infered);

} // namespace r
} // namespace arrow
Loading