ARROW-3814: [R] RecordBatch$from_arrays() #3565

romainfrancois · 2019-02-05T15:56:52Z

This started out as an implementation of RecordBatch$from_arrays() (i.e. https://issues.apache.org/jira/browse/ARROW-3814?filter=12344983) but now looks more like this issue: https://issues.apache.org/jira/browse/ARROW-3815?filter=12344983

The idea being that the record batch factory record_batch() would work with ... and schema, where each thing in the ... could be:

an arrow::Array
an R vector that can be converted to an array using array()

So where we had this before:

record_batch(tibble::tibble(x = 1:10, y = 1:10))

we would now have:

record_batch(x = 1:10, y = 1:10)

We would still be able to start from a data frame, via splicing, e.g.:

tbl <- tibble::tibble(x = 1:10, y = 1:10)
record_batch(!!!tbl)

So there would be no need for a RecordBatch$fromArray() method.

romainfrancois · 2019-02-05T16:11:17Z

I suppose the situation is similar for the table() factory

codecov-io · 2019-02-05T16:50:13Z

Codecov Report

Merging #3565 into master will decrease coverage by 11.02%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master    #3565       +/-   ##
===========================================
- Coverage   87.94%   76.92%   -11.03%     
===========================================
  Files         737       51      -686     
  Lines       81709     1976    -79733     
  Branches     1253        0     -1253     
===========================================
- Hits        71863     1520    -70343     
+ Misses       9599      456     -9143     
+ Partials      247        0      -247

Impacted Files	Coverage Δ
src/table.cpp	`64.17% <0%> (-4.25%)`	⬇️
src/array_from_vector.cpp	`78.29% <0%> (-0.05%)`	⬇️
R/write_arrow.R	`96.29% <0%> (ø)`	⬆️
R/feather.R	`58.33% <0%> (ø)`	⬆️
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
cpp/src/plasma/client.cc
cpp/src/arrow/io/test-common.h
... and 685 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5025126...ab0cd16. Read the comment docs.

romainfrancois · 2019-02-07T13:12:02Z

proposal for the table() function, that might be renamed Table() perhaps. The idea is that the function can handle two cases:

a variable list of record batches, which then results to a call to arrow::Table::FromRecordBatches :

library(arrow, warn.conflicts = FALSE)
library(purrr)

batch <- record_batch(x = 1:2, y = letters[1:2])

# variable number of batches
tab <- table(batch, batch, batch)
tab
#> arrow::Table
as_tibble(tab)
#> # A tibble: 6 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     1 a    
#> 4     2 b    
#> 5     1 a    
#> 6     2 b

# splicing support
batches <- map(1:10, ~record_batch(x = ., y = letters[.]))
tab <- table(!!!batches)
tab
#> arrow::Table
as_tibble(tab)
#> # A tibble: 10 x 2
#>        x y    
#>    <int> <chr>
#>  1     1 a    
#>  2     2 b    
#>  3     3 c    
#>  4     4 d    
#>  5     5 e    
#>  6     6 f    
#>  7     7 g    
#>  8     8 h    
#>  9     9 i    
#> 10    10 j

a named list of R vectors, R arrays or chunked arrays, e.g.

library(arrow, warn.conflicts = FALSE)
a <- array(rnorm(10))
tab <- table(x = 1:10, y = letters[1:10], z = a)
tab$schema
#> arrow::Schema 
#> x: int32
#> y: string
#> z: double
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x y          z
#>    <int> <chr>  <dbl>
#>  1     1 a      1.68 
#>  2     2 b      1.61 
#>  3     3 c      0.879
#>  4     4 d      0.315
#>  5     5 e      0.877
#>  6     6 f     -1.28 
#>  7     7 g      0.827
#>  8     8 h      0.494
#>  9     9 i      1.60 
#> 10    10 j     -1.66

# supports splicing too, e.g. 
tab <- table(
  row.number = 1:150,       # R integer vector -> converted to an Array
  !!!iris,                  # columns of iris are spliced, each is converted to an array
  arr = array(rnorm(150))   # an Array already
)
tab$schema
#> arrow::Schema 
#> row.number: int32
#> Sepal.Length: double
#> Sepal.Width: double
#> Petal.Length: double
#> Petal.Width: double
#> Species: dictionary<values=string, indices=int8, ordered=0>
#> arr: double
as_tibble(tab)
#> # A tibble: 150 x 7
#>    row.number Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          1          5.1         3.5          1.4         0.2 setosa 
#>  2          2          4.9         3            1.4         0.2 setosa 
#>  3          3          4.7         3.2          1.3         0.2 setosa 
#>  4          4          4.6         3.1          1.5         0.2 setosa 
#>  5          5          5           3.6          1.4         0.2 setosa 
#>  6          6          5.4         3.9          1.7         0.4 setosa 
#>  7          7          4.6         3.4          1.4         0.3 setosa 
#>  8          8          5           3.4          1.5         0.2 setosa 
#>  9          9          4.4         2.9          1.4         0.2 setosa 
#> 10         10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows, and 1 more variable: arr <dbl>

xhochy · 2019-02-08T21:25:13Z

Rebased.

romainfrancois · 2019-02-08T22:25:07Z

@xhochy was there a special need to rebase ? Just curious.
I usually only need to rebase after an R specific PR is squashed

xhochy · 2019-02-08T22:26:58Z

I rebased master on the release commit of the JavaScript 0.4.0 release.

We sadly only add the tagging commits to master once the release vote has passed. Then we need to rebase all PRs as we keep merging patches while the release vote runs.

romainfrancois · 2019-03-06T16:20:05Z

I think this is ready now.

wesm

Minor comments but this looks like a nice improvement. I think this can be merged after some small fixes /c comments

r/R/feather.R

r/R/write_arrow.R

r/src/recordbatch.cpp

romainfrancois · 2019-03-06T18:10:17Z

Thanks. I’ll deal with those in the morning.

wesm · 2019-03-07T23:41:52Z

@romainfrancois this needs to be rebased now after the lint fixes, sorry about that

romainfrancois · 2019-03-08T08:09:04Z

r/lint.sh

@@ -33,4 +33,4 @@ CPPLINT=$CPP_BUILD_SUPPORT/cpplint.py
 $CPP_BUILD_SUPPORT/run_cpplint.py \
    --cpplint_binary=$CPPLINT \
    --exclude_glob=$CPP_BUILD_SUPPORT/lint_exclusions.txt \
-    --source_dir=$SOURCE_DIR/src --quiet $1
+    --source_dir=$SOURCE_DIR/src --quiet


Did we need to pass $1 to run_cpp_lint.py too ?

Removing it here allows to use

./r/lint.sh --fix

and let the tool fix the format

I think this is okay

romainfrancois · 2019-05-31T09:32:57Z

I think this is good to go, but I'd like to merge #4413 first because currently this pr would fail against the new Dictionary changes from #4316

wesm · 2019-05-31T14:26:58Z

Sounds good

Needs discussion I guess

This will make it easier to do in parallel later.

…eplaced by RecordBatch__from_arrays()

…batch()

- list of record batches - list of - arrays - chunked arrays - columns - r vectors

…ithub.com/apache/arrow/pull/3635/files/08b295370271f122b410b991282b4919510b5cea#r261012517

./r/lint.sh --fix

wesm

+1

wesm · 2019-06-03T16:24:31Z

r/lint.sh

@@ -33,4 +33,4 @@ CPPLINT=$CPP_BUILD_SUPPORT/cpplint.py
 $CPP_BUILD_SUPPORT/run_cpplint.py \
    --cpplint_binary=$CPPLINT \
    --exclude_glob=$CPP_BUILD_SUPPORT/lint_exclusions.txt \
-    --source_dir=$SOURCE_DIR/src --quiet $1
+    --source_dir=$SOURCE_DIR/src --quiet


I think this is okay

wesm · 2019-06-03T16:28:29Z

r/tests/testthat/test-RecordBatch.R

+  expect_equal(s, batch$schema)
+
+  s <- schema(x = int32(), y = utf8())
+  expect_error(record_batch(x = 1:10, y = 1:10, schema = s))


If schema were a column name I guess you would have to pass the arguments differently

codecov-commenter · 2024-08-29T01:59:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.92%. Comparing base (5025126) to head (ab0cd16).

❗ There is a different number of reports uploaded between BASE (5025126) and HEAD (ab0cd16). Click for more details.

HEAD has 4 uploads less than BASE

Flag BASE (5025126) HEAD (ab0cd16)

5 1

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #3565       +/-   ##
===========================================
- Coverage   87.94%   76.92%   -11.03%     
===========================================
  Files         737       51      -686     
  Lines       81709     1976    -79733     
  Branches     1253        0     -1253     
===========================================
- Hits        71863     1520    -70343     
+ Misses       9599      456     -9143     
+ Partials      247        0      -247

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

romainfrancois added WIP PR is work in progress Component: R labels Feb 5, 2019

romainfrancois mentioned this pull request Feb 7, 2019

ARROW-3818: [R] Table$from_batches #3562

Closed

xhochy force-pushed the master branch from 29f76df to 2b9155a Compare February 8, 2019 21:00

xhochy force-pushed the ARROW-3814/record_batch_from_arrays branch from 7ab2a73 to 046ce90 Compare February 8, 2019 21:25

romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 046ce90 to cc30604 Compare February 13, 2019 08:59

wesm force-pushed the master branch from 3088183 to 0c6b2d2 Compare February 18, 2019 19:34

romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch 3 times, most recently from ef8b3ef to 0898dde Compare February 26, 2019 08:39

romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 0898dde to 2326d85 Compare March 6, 2019 13:39

romainfrancois requested a review from wesm March 6, 2019 16:19

wesm removed the WIP PR is work in progress label Mar 6, 2019

wesm reviewed Mar 6, 2019

View reviewed changes

r/R/feather.R Show resolved Hide resolved

r/R/write_arrow.R Show resolved Hide resolved

r/src/recordbatch.cpp Outdated Show resolved Hide resolved

romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 4891ac6 to 8283a94 Compare March 8, 2019 08:06

romainfrancois commented Mar 8, 2019

View reviewed changes

romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch 3 times, most recently from d42da88 to f51801d Compare March 29, 2019 13:33

kou force-pushed the master branch from 114985c to 57de5c3 Compare March 31, 2019 20:22

wesm force-pushed the ARROW-3814/record_batch_from_arrays branch from f51801d to 7f656ea Compare May 30, 2019 18:07

romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 7f656ea to 74a1e6b Compare May 31, 2019 09:13

romainfrancois and others added 21 commits June 3, 2019 09:09

schema() supports tidy dots splicing, using rlang::list2

9a0f996

+ list_to_shared_ptr_vector

cd03e19

Change record_batch() api so that it takes ... and schema.

d49906a

Needs discussion I guess

update docs

c71d872

move the logic of RecordBatch__from_arrays internally.

20c5ce6

This will make it easier to do in parallel later.

retire RecordBatch__from_dataframe() function, no longer needed and r…

c5ad626

…eplaced by RecordBatch__from_arrays()

table() factory also handles ... and !!! a schema, similar to record_…

c00774a

…batch()

table(...) cab now either handle ... being:

41b496f

- list of record batches - list of - arrays - chunked arrays - columns - r vectors

test for table(...<batches>)

7e8a4b7

tests for table(...<vectors, arrays, chunked arrays>)

68744f5

use the schema= argument in table()

d8e627f

record_batch(..., schema = )

d958108

directly return from builder_->Finish(), as suggested here: https://g…

fc885fd

…ithub.com/apache/arrow/pull/3635/files/08b295370271f122b410b991282b4919510b5cea#r261012517

tests about record_batch(schema=) argument

c04f904

STOP_IF migth be useful too

4574942

record_batch(schema=) compares names

efd84c5

add comments about !!!

eed535f

typo

2362ea0

Also run cpplint and clang-format on .cpp files

f27dcb9

rebase

6ea0778

only pass $1 to run_clang_format so that we can do:

ab0cd16

./r/lint.sh --fix

romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 74a1e6b to ab0cd16 Compare June 3, 2019 07:26

wesm approved these changes Jun 3, 2019

View reviewed changes

wesm closed this in 894b6e7 Jun 3, 2019

nealrichardson mentioned this pull request Jun 7, 2019

Use xenial to fix deprecated boost version in arrow devel build sparklyr/sparklyr#2032

Merged

asfimport mentioned this pull request Jun 3, 2019

[R] RecordBatch$from_arrays() #20252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3814: [R] RecordBatch$from_arrays() #3565

ARROW-3814: [R] RecordBatch$from_arrays() #3565

romainfrancois commented Feb 5, 2019

romainfrancois commented Feb 5, 2019

codecov-io commented Feb 5, 2019 •

edited

Loading

romainfrancois commented Feb 7, 2019

xhochy commented Feb 8, 2019

romainfrancois commented Feb 8, 2019

xhochy commented Feb 8, 2019

romainfrancois commented Mar 6, 2019

wesm left a comment

romainfrancois commented Mar 6, 2019

wesm commented Mar 7, 2019

romainfrancois Mar 8, 2019

wesm Jun 3, 2019

romainfrancois commented May 31, 2019

wesm commented May 31, 2019

wesm left a comment

wesm Jun 3, 2019

wesm Jun 3, 2019

codecov-commenter commented Aug 29, 2024

ARROW-3814: [R] RecordBatch$from_arrays() #3565

ARROW-3814: [R] RecordBatch$from_arrays() #3565

Conversation

romainfrancois commented Feb 5, 2019

romainfrancois commented Feb 5, 2019

codecov-io commented Feb 5, 2019 • edited Loading

Codecov Report

romainfrancois commented Feb 7, 2019

xhochy commented Feb 8, 2019

romainfrancois commented Feb 8, 2019

xhochy commented Feb 8, 2019

romainfrancois commented Mar 6, 2019

wesm left a comment

Choose a reason for hiding this comment

romainfrancois commented Mar 6, 2019

wesm commented Mar 7, 2019

romainfrancois Mar 8, 2019

Choose a reason for hiding this comment

wesm Jun 3, 2019

Choose a reason for hiding this comment

romainfrancois commented May 31, 2019

wesm commented May 31, 2019

wesm left a comment

Choose a reason for hiding this comment

wesm Jun 3, 2019

Choose a reason for hiding this comment

wesm Jun 3, 2019

Choose a reason for hiding this comment

codecov-commenter commented Aug 29, 2024

Codecov Report

codecov-io commented Feb 5, 2019 •

edited

Loading