Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3814: [R] RecordBatch$from_arrays() #3565

Conversation

romainfrancois
Copy link
Contributor

This started out as an implementation of RecordBatch$from_arrays() (i.e. https://issues.apache.org/jira/browse/ARROW-3814?filter=12344983) but now looks more like this issue: https://issues.apache.org/jira/browse/ARROW-3815?filter=12344983

The idea being that the record batch factory record_batch() would work with ... and schema, where each thing in the ... could be:

  • an arrow::Array
  • an R vector that can be converted to an array using array()

So where we had this before:

record_batch(tibble::tibble(x = 1:10, y = 1:10))

we would now have:

record_batch(x = 1:10, y = 1:10)

We would still be able to start from a data frame, via splicing, e.g.:

tbl <- tibble::tibble(x = 1:10, y = 1:10)
record_batch(!!!tbl)

So there would be no need for a RecordBatch$fromArray() method.

@romainfrancois
Copy link
Contributor Author

I suppose the situation is similar for the table() factory

@romainfrancois romainfrancois added WIP PR is work in progress Component: R labels Feb 5, 2019
@codecov-io
Copy link

codecov-io commented Feb 5, 2019

Codecov Report

Merging #3565 into master will decrease coverage by 11.02%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #3565       +/-   ##
===========================================
- Coverage   87.94%   76.92%   -11.03%     
===========================================
  Files         737       51      -686     
  Lines       81709     1976    -79733     
  Branches     1253        0     -1253     
===========================================
- Hits        71863     1520    -70343     
+ Misses       9599      456     -9143     
+ Partials      247        0      -247
Impacted Files Coverage Δ
src/table.cpp 64.17% <0%> (-4.25%) ⬇️
src/array_from_vector.cpp 78.29% <0%> (-0.05%) ⬇️
R/write_arrow.R 96.29% <0%> (ø) ⬆️
R/feather.R 58.33% <0%> (ø) ⬆️
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
cpp/src/plasma/client.cc
cpp/src/arrow/io/test-common.h
... and 685 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5025126...ab0cd16. Read the comment docs.

@romainfrancois
Copy link
Contributor Author

proposal for the table() function, that might be renamed Table() perhaps. The idea is that the function can handle two cases:

  • a variable list of record batches, which then results to a call to arrow::Table::FromRecordBatches :
library(arrow, warn.conflicts = FALSE)
library(purrr)

batch <- record_batch(x = 1:2, y = letters[1:2])

# variable number of batches
tab <- table(batch, batch, batch)
tab
#> arrow::Table
as_tibble(tab)
#> # A tibble: 6 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     1 a    
#> 4     2 b    
#> 5     1 a    
#> 6     2 b

# splicing support
batches <- map(1:10, ~record_batch(x = ., y = letters[.]))
tab <- table(!!!batches)
tab
#> arrow::Table
as_tibble(tab)
#> # A tibble: 10 x 2
#>        x y    
#>    <int> <chr>
#>  1     1 a    
#>  2     2 b    
#>  3     3 c    
#>  4     4 d    
#>  5     5 e    
#>  6     6 f    
#>  7     7 g    
#>  8     8 h    
#>  9     9 i    
#> 10    10 j
  • a named list of R vectors, R arrays or chunked arrays, e.g.
library(arrow, warn.conflicts = FALSE)
a <- array(rnorm(10))
tab <- table(x = 1:10, y = letters[1:10], z = a)
tab$schema
#> arrow::Schema 
#> x: int32
#> y: string
#> z: double
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x y          z
#>    <int> <chr>  <dbl>
#>  1     1 a      1.68 
#>  2     2 b      1.61 
#>  3     3 c      0.879
#>  4     4 d      0.315
#>  5     5 e      0.877
#>  6     6 f     -1.28 
#>  7     7 g      0.827
#>  8     8 h      0.494
#>  9     9 i      1.60 
#> 10    10 j     -1.66

# supports splicing too, e.g. 
tab <- table(
  row.number = 1:150,       # R integer vector -> converted to an Array
  !!!iris,                  # columns of iris are spliced, each is converted to an array
  arr = array(rnorm(150))   # an Array already
)
tab$schema
#> arrow::Schema 
#> row.number: int32
#> Sepal.Length: double
#> Sepal.Width: double
#> Petal.Length: double
#> Petal.Width: double
#> Species: dictionary<values=string, indices=int8, ordered=0>
#> arr: double
as_tibble(tab)
#> # A tibble: 150 x 7
#>    row.number Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>         <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          1          5.1         3.5          1.4         0.2 setosa 
#>  2          2          4.9         3            1.4         0.2 setosa 
#>  3          3          4.7         3.2          1.3         0.2 setosa 
#>  4          4          4.6         3.1          1.5         0.2 setosa 
#>  5          5          5           3.6          1.4         0.2 setosa 
#>  6          6          5.4         3.9          1.7         0.4 setosa 
#>  7          7          4.6         3.4          1.4         0.3 setosa 
#>  8          8          5           3.4          1.5         0.2 setosa 
#>  9          9          4.4         2.9          1.4         0.2 setosa 
#> 10         10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows, and 1 more variable: arr <dbl>

@xhochy
Copy link
Member

xhochy commented Feb 8, 2019

Rebased.

@romainfrancois
Copy link
Contributor Author

@xhochy was there a special need to rebase ? Just curious.
I usually only need to rebase after an R specific PR is squashed

@xhochy
Copy link
Member

xhochy commented Feb 8, 2019

I rebased master on the release commit of the JavaScript 0.4.0 release.

We sadly only add the tagging commits to master once the release vote has passed. Then we need to rebase all PRs as we keep merging patches while the release vote runs.

@romainfrancois romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 046ce90 to cc30604 Compare February 13, 2019 08:59
@romainfrancois romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch 3 times, most recently from ef8b3ef to 0898dde Compare February 26, 2019 08:39
@romainfrancois romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 0898dde to 2326d85 Compare March 6, 2019 13:39
@romainfrancois romainfrancois requested a review from wesm March 6, 2019 16:19
@romainfrancois
Copy link
Contributor Author

I think this is ready now.

@wesm wesm removed the WIP PR is work in progress label Mar 6, 2019
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments but this looks like a nice improvement. I think this can be merged after some small fixes /c comments

r/R/feather.R Show resolved Hide resolved
r/R/write_arrow.R Show resolved Hide resolved
r/src/recordbatch.cpp Outdated Show resolved Hide resolved
@romainfrancois
Copy link
Contributor Author

Thanks. I’ll deal with those in the morning.

@wesm
Copy link
Member

wesm commented Mar 7, 2019

@romainfrancois this needs to be rebased now after the lint fixes, sorry about that

@romainfrancois romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 4891ac6 to 8283a94 Compare March 8, 2019 08:06
@@ -33,4 +33,4 @@ CPPLINT=$CPP_BUILD_SUPPORT/cpplint.py
$CPP_BUILD_SUPPORT/run_cpplint.py \
--cpplint_binary=$CPPLINT \
--exclude_glob=$CPP_BUILD_SUPPORT/lint_exclusions.txt \
--source_dir=$SOURCE_DIR/src --quiet $1
--source_dir=$SOURCE_DIR/src --quiet
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we need to pass $1 to run_cpp_lint.py too ?

Removing it here allows to use

./r/lint.sh --fix

and let the tool fix the format

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay

@romainfrancois romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch 3 times, most recently from d42da88 to f51801d Compare March 29, 2019 13:33
@wesm wesm force-pushed the ARROW-3814/record_batch_from_arrays branch from f51801d to 7f656ea Compare May 30, 2019 18:07
@romainfrancois romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 7f656ea to 74a1e6b Compare May 31, 2019 09:13
@romainfrancois
Copy link
Contributor Author

I think this is good to go, but I'd like to merge #4413 first because currently this pr would fail against the new Dictionary changes from #4316

@wesm
Copy link
Member

wesm commented May 31, 2019

Sounds good

@romainfrancois romainfrancois force-pushed the ARROW-3814/record_batch_from_arrays branch from 74a1e6b to ab0cd16 Compare June 3, 2019 07:26
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -33,4 +33,4 @@ CPPLINT=$CPP_BUILD_SUPPORT/cpplint.py
$CPP_BUILD_SUPPORT/run_cpplint.py \
--cpplint_binary=$CPPLINT \
--exclude_glob=$CPP_BUILD_SUPPORT/lint_exclusions.txt \
--source_dir=$SOURCE_DIR/src --quiet $1
--source_dir=$SOURCE_DIR/src --quiet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay

expect_equal(s, batch$schema)

s <- schema(x = int32(), y = utf8())
expect_error(record_batch(x = 1:10, y = 1:10, schema = s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If schema were a column name I guess you would have to pass the arguments differently

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.92%. Comparing base (5025126) to head (ab0cd16).

❗ There is a different number of reports uploaded between BASE (5025126) and HEAD (ab0cd16). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (5025126) HEAD (ab0cd16)
5 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3565       +/-   ##
===========================================
- Coverage   87.94%   76.92%   -11.03%     
===========================================
  Files         737       51      -686     
  Lines       81709     1976    -79733     
  Branches     1253        0     -1253     
===========================================
- Hits        71863     1520    -70343     
+ Misses       9599      456     -9143     
+ Partials      247        0      -247     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants