Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Support matrix objects as fixed-size-list arrays #692

Merged
merged 10 commits into from
Dec 17, 2024

Conversation

paleolimbot
Copy link
Member

Still needs some testing on the stream case, and is unfortunately not very zero copy; however, gets the job done (and I think fixes some cases where we would have otherwise silently handled a matrix as the storage type).

Inspired by #691!

library(nanoarrow)

df <- data.frame(x = 1:10)
df$matrix_col <- matrix(letters[1:20], ncol = 2, byrow = TRUE)

array <- as_nanoarrow_array(df)

# Default comes back as list_of(character())
convert_array(array) |> tibble::as_tibble()
#> # A tibble: 10 × 2
#>        x  matrix_col
#>    <int> <list<chr>>
#>  1     1         [2]
#>  2     2         [2]
#>  3     3         [2]
#>  4     4         [2]
#>  5     5         [2]
#>  6     6         [2]
#>  7     7         [2]
#>  8     8         [2]
#>  9     9         [2]
#> 10    10         [2]

# But can specify matrix
convert_array(
  array,
  tibble::tibble(x = integer(), matrix_col = matrix(character(), ncol = 2))
)
#> # A tibble: 10 × 2
#>        x matrix_col[,1] [,2] 
#>    <int> <chr>          <chr>
#>  1     1 a              b    
#>  2     2 c              d    
#>  3     3 e              f    
#>  4     4 g              h    
#>  5     5 i              j    
#>  6     6 k              l    
#>  7     7 m              n    
#>  8     8 o              p    
#>  9     9 q              r    
#> 10    10 s              t

Created on 2024-12-12 with reprex v2.1.1

@eddelbuettel
Copy link
Contributor

eddelbuettel commented Dec 13, 2024

Do you foresee enhancing the C(++) side of things as well? Or maybe enough is in place -- I have not tried constructing or consuming '+w:$N' objects yet.

@eddelbuettel
Copy link
Contributor

This is nice. When running with numeric vectors it is even closer to zero-copy as it looks like as we retain the buffer:

> library(nanoarrow)
> df <- data.frame(x = 1:10)
> df$matrix_col <- matrix(1:20, ncol = 2, byrow = TRUE)
> array <- as_nanoarrow_array(df)
> array
<nanoarrow_array struct[10]>
 $ length    : int 10
 $ null_count: int 0
 $ offset    : int 0
 $ buffers   :List of 1
  ..$ :<nanoarrow_buffer validity<bool>[null] ``
 $ children  :List of 2
  ..$ x         :<nanoarrow_array int32[10]>
  .. ..$ length    : int 10
  .. ..$ null_count: int 0
  .. ..$ offset    : int 0
  .. ..$ buffers   :List of 2
  .. .. ..$ :<nanoarrow_buffer validity<bool>[null] ``
  .. .. ..$ :<nanoarrow_buffer data<int32>[10][40 b]> `1 2 3 4 5 6 7 8 9 10`
  .. ..$ dictionary: NULL
  .. ..$ children  : list()
  ..$ matrix_col:<nanoarrow_array fixed_size_list(2)[10]>
  .. ..$ length    : int 10
  .. ..$ null_count: int 0
  .. ..$ offset    : int 0
  .. ..$ buffers   :List of 1
  .. .. ..$ :<nanoarrow_buffer validity<bool>[null] ``
  .. ..$ children  :List of 1
  .. .. ..$ item:<nanoarrow_array int32[20]>
  .. .. .. ..$ length    : int 20
  .. .. .. ..$ null_count: int 0
  .. .. .. ..$ offset    : int 0
  .. .. .. ..$ buffers   :List of 2
  .. .. .. .. ..$ :<nanoarrow_buffer validity<bool>[null] ``
  .. .. .. .. ..$ :<nanoarrow_buffer data<int32>[20][80 b]> `1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20`
  .. .. .. ..$ dictionary: NULL
  .. .. .. ..$ children  : list()
  .. ..$ dictionary: NULL
 $ dictionary: NULL
> 
> convert_array(array, tibble::tibble(x = integer(), matrix_col = matrix(integer(), ncol = 2)))
# A tibble: 10 × 2
       x matrix_col[,1]  [,2]
   <int>          <int> <int>
 1     1              1     2
 2     2              3     4
 3     3              5     6
 4     4              7     8
 5     5              9    10
 6     6             11    12
 7     7             13    14
 8     8             15    16
 9     9             17    18
10    10             19    20
> 

@paleolimbot
Copy link
Member Author

This is nice. When running with numeric vectors it is even closer to zero-copy as it looks like as we retain the buffer:

Yes, it's so close! If R did row-major matrices it would be. I need one more iteration here to get this down to a single matrix() call as the source of the copy but it's relatively close!

Do you foresee enhancing the C(++) side of things as well?

Getting the nested offsets is still pretty tricky. I'd like to provide something like:

int ArrowArrayViewGetListElement(ArrowArrayView* view, int64_t offset, ArrowArrayView** view_out, int64_t* offset_out, int64_t* length_out);

...which I think will be rather helpful for people to ingest the wild combination of Arrow storage that could possible give you a list as element of an array (e.g., listview, dictionary of list, run end encoded of list, etc.).

@paleolimbot paleolimbot marked this pull request as ready for review December 16, 2024 16:34
@paleolimbot
Copy link
Member Author

meson-build CI failure is I think due to an Ubuntu runner upgrade.

Copy link
Member

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some high level manual testing and this is working as expected for me.

@paleolimbot
Copy link
Member Author

Thank you for taking a look!

@paleolimbot paleolimbot merged commit 1eaa8d5 into apache:main Dec 17, 2024
34 of 36 checks passed
@paleolimbot paleolimbot deleted the r-matrix-support branch December 17, 2024 03:14
@eddelbuettel eddelbuettel mentioned this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants