-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3814: [R] RecordBatch$from_arrays() #3565
ARROW-3814: [R] RecordBatch$from_arrays() #3565
Conversation
I suppose the situation is similar for the |
Codecov Report
@@ Coverage Diff @@
## master #3565 +/- ##
===========================================
- Coverage 87.94% 76.92% -11.03%
===========================================
Files 737 51 -686
Lines 81709 1976 -79733
Branches 1253 0 -1253
===========================================
- Hits 71863 1520 -70343
+ Misses 9599 456 -9143
+ Partials 247 0 -247
Continue to review full report at Codecov.
|
proposal for the
library(arrow, warn.conflicts = FALSE)
library(purrr)
batch <- record_batch(x = 1:2, y = letters[1:2])
# variable number of batches
tab <- table(batch, batch, batch)
tab
#> arrow::Table
as_tibble(tab)
#> # A tibble: 6 x 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 1 a
#> 4 2 b
#> 5 1 a
#> 6 2 b
# splicing support
batches <- map(1:10, ~record_batch(x = ., y = letters[.]))
tab <- table(!!!batches)
tab
#> arrow::Table
as_tibble(tab)
#> # A tibble: 10 x 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
#> 7 7 g
#> 8 8 h
#> 9 9 i
#> 10 10 j
library(arrow, warn.conflicts = FALSE)
a <- array(rnorm(10))
tab <- table(x = 1:10, y = letters[1:10], z = a)
tab$schema
#> arrow::Schema
#> x: int32
#> y: string
#> z: double
as_tibble(tab)
#> # A tibble: 10 x 3
#> x y z
#> <int> <chr> <dbl>
#> 1 1 a 1.68
#> 2 2 b 1.61
#> 3 3 c 0.879
#> 4 4 d 0.315
#> 5 5 e 0.877
#> 6 6 f -1.28
#> 7 7 g 0.827
#> 8 8 h 0.494
#> 9 9 i 1.60
#> 10 10 j -1.66
# supports splicing too, e.g.
tab <- table(
row.number = 1:150, # R integer vector -> converted to an Array
!!!iris, # columns of iris are spliced, each is converted to an array
arr = array(rnorm(150)) # an Array already
)
tab$schema
#> arrow::Schema
#> row.number: int32
#> Sepal.Length: double
#> Sepal.Width: double
#> Petal.Length: double
#> Petal.Width: double
#> Species: dictionary<values=string, indices=int8, ordered=0>
#> arr: double
as_tibble(tab)
#> # A tibble: 150 x 7
#> row.number Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <int> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 5.1 3.5 1.4 0.2 setosa
#> 2 2 4.9 3 1.4 0.2 setosa
#> 3 3 4.7 3.2 1.3 0.2 setosa
#> 4 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 5 3.6 1.4 0.2 setosa
#> 6 6 5.4 3.9 1.7 0.4 setosa
#> 7 7 4.6 3.4 1.4 0.3 setosa
#> 8 8 5 3.4 1.5 0.2 setosa
#> 9 9 4.4 2.9 1.4 0.2 setosa
#> 10 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows, and 1 more variable: arr <dbl> |
7ab2a73
to
046ce90
Compare
Rebased. |
@xhochy was there a special need to rebase ? Just curious. |
I rebased master on the release commit of the JavaScript 0.4.0 release. We sadly only add the tagging commits to master once the release vote has passed. Then we need to rebase all PRs as we keep merging patches while the release vote runs. |
046ce90
to
cc30604
Compare
ef8b3ef
to
0898dde
Compare
0898dde
to
2326d85
Compare
I think this is ready now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments but this looks like a nice improvement. I think this can be merged after some small fixes /c comments
Thanks. I’ll deal with those in the morning. |
@romainfrancois this needs to be rebased now after the lint fixes, sorry about that |
4891ac6
to
8283a94
Compare
@@ -33,4 +33,4 @@ CPPLINT=$CPP_BUILD_SUPPORT/cpplint.py | |||
$CPP_BUILD_SUPPORT/run_cpplint.py \ | |||
--cpplint_binary=$CPPLINT \ | |||
--exclude_glob=$CPP_BUILD_SUPPORT/lint_exclusions.txt \ | |||
--source_dir=$SOURCE_DIR/src --quiet $1 | |||
--source_dir=$SOURCE_DIR/src --quiet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we need to pass $1 to run_cpp_lint.py
too ?
Removing it here allows to use
./r/lint.sh --fix
and let the tool fix the format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay
d42da88
to
f51801d
Compare
f51801d
to
7f656ea
Compare
7f656ea
to
74a1e6b
Compare
Sounds good |
Needs discussion I guess
This will make it easier to do in parallel later.
…eplaced by RecordBatch__from_arrays()
- list of record batches - list of - arrays - chunked arrays - columns - r vectors
./r/lint.sh --fix
74a1e6b
to
ab0cd16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -33,4 +33,4 @@ CPPLINT=$CPP_BUILD_SUPPORT/cpplint.py | |||
$CPP_BUILD_SUPPORT/run_cpplint.py \ | |||
--cpplint_binary=$CPPLINT \ | |||
--exclude_glob=$CPP_BUILD_SUPPORT/lint_exclusions.txt \ | |||
--source_dir=$SOURCE_DIR/src --quiet $1 | |||
--source_dir=$SOURCE_DIR/src --quiet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay
expect_equal(s, batch$schema) | ||
|
||
s <- schema(x = int32(), y = utf8()) | ||
expect_error(record_batch(x = 1:10, y = 1:10, schema = s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If schema
were a column name I guess you would have to pass the arguments differently
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3565 +/- ##
===========================================
- Coverage 87.94% 76.92% -11.03%
===========================================
Files 737 51 -686
Lines 81709 1976 -79733
Branches 1253 0 -1253
===========================================
- Hits 71863 1520 -70343
+ Misses 9599 456 -9143
+ Partials 247 0 -247 ☔ View full report in Codecov by Sentry. |
This started out as an implementation of
RecordBatch$from_arrays()
(i.e. https://issues.apache.org/jira/browse/ARROW-3814?filter=12344983) but now looks more like this issue: https://issues.apache.org/jira/browse/ARROW-3815?filter=12344983The idea being that the record batch factory
record_batch()
would work with...
andschema
, where each thing in the...
could be:arrow::Array
array()
So where we had this before:
we would now have:
We would still be able to start from a data frame, via splicing, e.g.:
So there would be no need for a
RecordBatch$fromArray()
method.