ARROW-15471: [R] ExtensionType support in R #12467

paleolimbot · 2022-02-18T22:23:22Z

This PR implements extension type support and registration in the R bindings (as has been possible in the Python bindings for some time). The details still need to be worked out, but we at least have a working pattern:

library(arrow, warn.conflicts = FALSE)
library(R6)

SomeExtensionTypeSubclass <- R6Class(
  "SomeExtensionTypeSubclass", inherit = arrow:::ExtensionType,
  public = list(
    some_custom_method = function() {
      private$some_custom_field
    },
    
    .Deserialize = function(storage_type, extension_name, extension_metadata) {
      private$some_custom_field <- head(extension_metadata, 5)
    }
  ),
  private = list(
    some_custom_field = NULL
  )
)

SomeExtensionArraySubclass <- R6Class(
  "SomeExtensionArraySubclass", inherit = arrow:::ExtensionArray,
  public = list(
    some_custom_method = function() {
      self$type$some_custom_method()
    }
  )
)

type <- arrow:::MakeExtensionType(
  int32(),
  "some_extension_subclass",
  charToRaw("some custom metadata"),
  type_class = SomeExtensionTypeSubclass,
  array_class = SomeExtensionArraySubclass
)


arrow:::RegisterExtensionType(type)

# survives the C API round trip
ptr_type <- arrow:::allocate_arrow_schema()
type$export_to_c(ptr_type)
type2 <- arrow:::DataType$import_from_c(ptr_type)


type2
#> SomeExtensionTypeSubclass
#> SomeExtensionTypeSubclass <some custom metadata>
type2$some_custom_method()
#> [1] 73 6f 6d 65 20

(array <- type$WrapArray(Array$create(1:10)))
#> SomeExtensionArraySubclass
#> <SomeExtensionTypeSubclass <some custom metadata>>
#> [
#>   1,
#>   2,
#>   3,
#>   4,
#>   5,
#>   6,
#>   7,
#>   8,
#>   9,
#>   10
#> ]
array$some_custom_method()
#> [1] 73 6f 6d 65 20

ptr_array <- arrow:::allocate_arrow_array()
array$export_to_c(ptr_array, ptr_type)
(array2 <- Array$import_from_c(ptr_array, ptr_type))
#> SomeExtensionArraySubclass
#> <SomeExtensionTypeSubclass <some custom metadata>>
#> [
#>   1,
#>   2,
#>   3,
#>   4,
#>   5,
#>   6,
#>   7,
#>   8,
#>   9,
#>   10
#> ]

arrow:::delete_arrow_schema(ptr_type)
arrow:::delete_arrow_array(ptr_array)

^{Created on 2022-02-18 by the reprex package (v2.0.1)}

github-actions · 2022-02-18T22:23:39Z

https://issues.apache.org/jira/browse/ARROW-15471

github-actions · 2022-02-18T22:23:40Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jonkeane

This approach looks solid, I don't have any specific comments (aside from a tiny style nit).

Would it be possible to include in the tests a more realistic example? It doesn't have to be as complicated as geo objects, but something that defines some data structure + shows how one might interact with it might be helpful.

This might be helpful or not, but my first thought was something kinda like haven:: labelled_spss() Not that we want to or need to support that, or that we want to emulate spss. But it's a funny(ish) type that is clearly outside of the scope for arrow types, but could have some interesting custom bits about them that might make the tests a bit more demonstrative of what this could do?

r/R/extension.R

jonkeane

A few comments, I would also love to get an opinion from @romainfrancois if he has time to take a look.

I also will pull this and mess around with it locally today or on Monday — but wanted to get these comments out before then.

r/R/arrow-package.R

jonkeane · 2022-03-18T18:47:36Z

r/R/arrow-tabular.R

base here is a function that is either Table__to_dataframe or RecordBatch__to_dataframe yeah? Basically the constructor to be used if this isn't an extension type?

It might be nice to have a slightly more descriptive name for that?

r/R/arrow-tabular.R

r/R/extension.R

paleolimbot · 2022-03-17T13:57:08Z

r/R/arrow-tabular.R

I'm worried about this implementation because it's unintuitive...this gets used by Table$to_data_frame() and RecordBatch$to_data_frame() because both of those call into C++ to do their thing (but the C++ implementation doesn't know about extension types. Maybe it should?). Pretty much everwhere else we avoid looping over columns in R but that might be better than added complexity at the C++ level?

paleolimbot · 2022-03-17T13:59:09Z

r/src/extension.cpp

It's true that I need to use std::shared_ptr<cpp11::environment> to store a the r6_class_ field here instead of cpp11::environment to avoid a crash, but I'm not entirely sure I'm using std::shared_ptr correctly.

paleolimbot · 2022-03-17T14:09:46Z

r/src/extension.cpp

This is the main threading concern...the Deserialize() method gets called from other threads frequently but unless it's been passed through an R6 instance in R, we don't know if the metadata is valid or not.

paleolimbot · 2022-03-17T14:12:08Z

r/R/extension.R

I made up the "dot prefix means protected method" thing here...I don't know if there is a convention for "protected"-style methods in R6 but would be happy to use it if it exists.

paleolimbot · 2022-03-21T12:33:26Z

r/src/extension.cpp

This is another threading concern...the data types can't be checked for equality if the serialized data is not identical and the comparison occurs on another thread. I think this might happen when reading a multi-file dataset if some of the files were written differently.

r/R/extension.R

jonkeane · 2022-03-21T20:10:28Z

This is very exciting. Like I mentioned earlier, I wanted to try this out locally to see what this looks like. The example is a little contrived (and actually AFAIU, not totally accurate depending on the time of year!)

Is it expected that roundtripping without the {vctrs} class wouldn't work? (Or did I do something wrong here?

library(arrow, warn.conflicts = FALSE)

# Is this the minimal structure to create a custom class like this?
KoreanAge <- R6::R6Class(
  "KoreanAge", 
  inherit = ExtensionType,
  public = list(
    .array_as_vector = function(extension_array) {
      extension_array$storage()$as_vector() + 1
    }
  )
)

KoreanAge <- new_extension_type(
  int32(),
  "KoreanAge",
  charToRaw("Korean Age, but stored as the western age value"),
  type_class = KoreanAge
)

arr <- new_extension_array(c(0, 1, 2), KoreanAge)

# What we expect (storage + 1)
as.vector(arr)
#> [1] 1 2 3

# But roundtripping doesn't seem to work?
tf <- tempfile()
write_feather(arrow_table(col = arr), tf)

tab <- read_feather(tf, as_data_frame = FALSE)

type(tab$col)
#> Int32
#> int32

as.vector(tab$col)
#> [1] 0 1 2

Also, should we export ExtensionArray? It doesn't look like it is, but we do have Array etc. exported. The docs additions are really great + descriptive. But I do wonder if an example (or two) would be nice, even if they were pretty trivial extension type like this (or some of the vctrs examples with percentages and the like).

Do we have a follow on for what to do about printing the array? You'll see here you print the underlying storage type, which might be fine, but that has confused some folks before.

paleolimbot · 2022-03-22T12:33:59Z

The key step that was missing for the roundtrip was register_extension_type(), which is needed so that Arrow C++ knows not to discard the extension metadata when it encounters the type! (see details).

I should probably export ExtensionArray and use ExtensionArray$create() rather than new_extension_array() since it's more arrow-ish to do that. Maybe ExtensionType$create() instead of new_extension_type() is where extension type creation should go, too.

Printing is a good point...definitely confusing in the case of an extension type!

Details

library(arrow, warn.conflicts = FALSE)

KoreanAge <- R6::R6Class(
  "KoreanAge", 
  inherit = ExtensionType,
  public = list(
    .array_as_vector = function(extension_array) {
      extension_array$storage()$as_vector() + 1
    }
  )
)

# constructor helpers
korean_age <- function() {
  new_extension_type(
    int32(),
    "KoreanAge",
    charToRaw("Korean Age, but stored as the western age value"),
    type_class = KoreanAge
  )
}

korean_age_array <- function(age_korean) {
  new_extension_array(age_korean - 1, korean_age())
}

(arr <- korean_age_array(1:3))
#> ExtensionArray
#> <KoreanAge <Korean Age, but stored as the western age value>>
#> [
#>   0,
#>   1,
#>   2
#> ]
as.vector(arr)
#> [1] 1 2 3

# you need to register the type for Arrow C++ to keep the extension type
# slash metadata when it encounters it at the C++ level (import from C
# and reading files)
register_extension_type(korean_age())

tf <- tempfile()
write_feather(arrow_table(col = arr), tf)

tab <- read_feather(tf, as_data_frame = FALSE)

type(tab$col)
#> KoreanAge
#> KoreanAge <Korean Age, but stored as the western age value>

as.vector(tab$col)
#> [1] 0 1 2

jonkeane · 2022-03-22T13:08:00Z

Aaaah, yeah I totally missed register_extension_type() oops!

Though weirdly(?) the as.vector() here is still wrong, it should be 1 2 3 (since the ->R conversion on this Extension type should have 1 added to it)

# you need to register the type for Arrow C++ to keep the extension type
# slash metadata when it encounters it at the C++ level (import from C
# and reading files)
register_extension_type(korean_age())

tf <- tempfile()
write_feather(arrow_table(col = arr), tf)

tab <- read_feather(tf, as_data_frame = FALSE)

type(tab$col)
#> KoreanAge
#> KoreanAge <Korean Age, but stored as the western age value>

as.vector(tab$col)
#> [1] 0 1 2

I should probably export ExtensionArray and use ExtensionArray$create() rather than new_extension_array() since it's more arrow-ish to do that.

In other places we've exposed both, which I think isn't bad here (it's slightly more API we manage, but having the R6 stuff exposed makes it easier to extend, and having the foo_bar_baz looks nicer | is a bit more friendly to users|doesn't force someone to learn all the R6 if what they want or need is available there)...

Maybe ExtensionType$create() instead of new_extension_type() is where extension type creation should go, too.

Hmm, if we think it's ok to do the new_extension_type() inside of ExtensionType$create() I would agree that having a second step like that is unnecessary. Though with the proper docs having the R6 way of establishing these need to be slightly more verbose and you need to manage other things is inline with our other implementations of things like that (e.g. creating one's own filesystem, etc.)

paleolimbot · 2022-03-22T13:37:42Z

Ah yes, you clearly should have remembered to implement KoreanAge$.chunked_array_as_vector() (...I'll implement a better default method for that one...)

paleolimbot · 2022-03-22T13:55:50Z

Printing is a good point...definitely confusing in the case of an extension type!

I think we have to punt on the printing...there isn't a way to customize how other Array objects are printed (it all goes through ChunkedArray::ToString()/Array::ToString() at the C++ level and that is very specifically a PrettyPrint() C++ thing. We have some tools available in R to improve the printing of all ArrowTabular/ChunkedArray/Arrays, and maybe a PR with that as its scope would be more appropriate.

jorisvandenbossche

This is really cool!

Looking a little bit through the code and the test cases, I am wondering if what we call in python "parametrized" extension types are possible. There is a test case that ensures you can roundtrip an extension type with a different storage type as how it was registered. But those test examples only override Deserialize; is it also possible to override Serialize? (so you can store some custom field in there).

paleolimbot · 2022-03-22T16:28:36Z

I think so! This example is probably better than the example I have in there right now because the serializing/deserializing of the metadata is a big part of the picture and the current documentation example only implements the array-to-r conversion. Check to make sure it's what you meant though!

I didn't implement this quite in the same way as the Python one...I think in Python the workflow (and correct me if I'm wrong) is along the lines of

Create a ExtensionTypeSubclass(with, parameters, like, this) instance
C++ calls the __arrow_ext_serialize__ method of the Python instance when the serialized metadata is needed

In R it's totally bananas to call from C++ into R and we can't do it safely most of the time. So instead I wrote it like:

Create the extension metadata before either the R or C++ instance is created
Create a C++ instance of RExtensionType() that contains the definitive copy of the serialized extension metadata
Create the ExtensionTypeSubclass R6 instance and then call the R6 instance's .Deserialize() method to populate data fields.

It isn't all that straightforward to do (the way I've implemented it in R) and I'm not sure I like how it's implemented (but I'm also not sure how to make it better).

Details

library(arrow, warn.conflicts = FALSE)

QuantizedType <- R6::R6Class(
  "QuantizedType", 
  inherit = ExtensionType,
  public = list(
    center = function() private$.center,
    scale = function() private$.scale,
    
    .array_as_vector = function(extension_array) {
      as.vector(extension_array$storage() / private$.scale + private$.center)
    },
    
    .Deserialize = function(storage_type, extension_name, extension_metadata) {
      parsed <- jsonlite::fromJSON(self$extension_metadata_utf8())
      private$.center <- as.double(parsed$center)
      private$.scale <- as.double(parsed$scale)
    }
  ),
  private = list(
    .center = NULL,
    .scale = NULL
  )
)

quantized <- function(center = 0, scale = 1, storage_type = int32()) {
  new_extension_type(
    storage_type = storage_type,
    extension_name = "arrow.example.quantized",
    extension_metadata = jsonlite::toJSON(
      list(
        center = jsonlite::unbox(as.double(center)),
        scale = jsonlite::unbox(as.double(scale))
      )
    ),
    type_class = QuantizedType
  )
}

quantized_array <- function(x, center = 0, scale = 1, 
                            storage_type = int32()) {
  type <- quantized(center, scale, storage_type)
  new_extension_array(
    Array$create((x - center) * scale, type = storage_type),
    type
  )
}

reregister_extension_type(quantized())

(vals <- runif(5, min = 19, max = 21))
#> [1] 19.33526 19.47467 19.14288 20.39798 19.04523

(array <- quantized_array(
  vals,
  center = 20,
  scale = 2 ^ 15 - 1,
  storage_type = int16())
)
#> ExtensionArray
#> <QuantizedType <{"center":20,"scale":32767}>>
#> [
#>   -21781,
#>   -17213,
#>   -28085,
#>   13040,
#>   -31284
#> ]

array$type$center()
#> [1] 20
array$type$scale()
#> [1] 32767

as.vector(array)
#> [1] 19.33528 19.47468 19.14289 20.39796 19.04526

jonkeane · 2022-03-22T17:58:48Z

I haven't dug too deeply yet, but the QuantizedType example is fantastic. I really like that _ think it shows off what/how this is capable of doing in a way that the previous examples (both mine and the ones in the test) didn't quite fully get to

paleolimbot · 2022-03-22T18:28:43Z

I like it too! I like that it isn't R specific, that it needs parameterization, and that transforming it back into an R vector needs a calculation. I didn't do any explaining in the docs about why quantized types are cool...I've used them in raster GIS stuff because they enable really efficient storage (in exchange for some precision loss).

jorisvandenbossche · 2022-03-22T18:35:27Z

I think so! This example is probably better

That example indeed nicely shows that it is possible.
I suppose one thing that confused me is that new_extension_type(..) rather creates a new instance of a given type, and not a new type itself (and thus this allows you to create instances of a given extension type with different metadata).

It isn't all that straightforward to do (the way I've implemented it in R) and I'm not sure I like how it's implemented (but I'm also not sure how to make it better).

I think the fact that you already determine the serialized metadata upfront in R is fine / nice (the fact that in Python this is C++ calling back into python is kind of a complication, as the metadata could be known at the point when instantiating the python extension type instance).
One thing I was wondering: the reason you can't define a .Serialize method on the QuantizedType that can hold this logic to create the serialized metadata (instead of doing that inside new_extension_type( ..., extension_metadata=... )), is that this class instance is only created after the underlying C++ extension type, for which you already need the serialized metadata? It's not possible to first initialize the QuantizedType object, and let that call Serialize to then create the underlying C++ object? (disclaimer: I know nothing about R classes :))

paleolimbot · 2022-03-28T18:16:57Z

Sorry Joris for the late reply!

is that this class instance is only created after the underlying C++ extension type, for which you already need the serialized metadata?

That's a really good question...it's might possible to do this the other way around, which would make it a little more intuitive. Creating the R6 instance from C++ (usually via automatically generated wrapper code) is the default...the default constructor errors if there isn't a valid one:

arrow/r/R/arrow-package.R

Lines 310 to 322 in 1b796ec

    
           initialize = function(xp) self$set_pointer(xp), 
        
           pointer = function() get(".:xp:.", envir = self), 
        
           `.:xp:.` = NULL, 
        
           set_pointer = function(xp) { 
        
             if (!inherits(xp, "externalptr")) { 
        
               stop( 
        
                 class(self)[1], "$new() requires a pointer as input: ", 
        
                 "did you mean $create() instead?", 
        
                 call. = FALSE 
        
               ) 
        
             } 
        
             assign(".:xp:.", xp, envir = self) 
        
           },

The alternative would be to materialize the C++ object when it is required. I will play with that because I really don't like the dot prefix thing that I'm currently doing.

paleolimbot · 2022-03-29T13:48:56Z

A few more modifications:

I moved conversion of extension types to R objects into the Converter API and removed the modifications to ChunkedArray, Table, and RecordBatch (Romain suggested this). This feels a lot better and is more likely to "just work" in more places.
I implemented the geoarrow side of this to make sure it will work. It does (except for a bit in one of the compute kernels where Concatenate doesn't work for extension types)! See details and Implement arrow::ExtensionType geoarrow/geoarrow-r#7
I did play with reversing the order of instantiation of the C++ and R6 objects...I think that change is a kind of a big one with respect to how all objects get passed around in our current implementation. I did, however, rename the methods to match the C++ method names and it feels a lot better now.

Details

# remotes::install_github("apache/arrow#12467")
# remotes::install_github("paleolimbot/geoarrow@arrow-ext-type")
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(geoarrow)

places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
places <- open_dataset(places_folder)
places$schema$geometry$type
#> GeoArrowType
#> point GEOGCS["WGS 84",DATUM["WGS_...
places$schema$geometry$type$crs
#> [1] "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],AXIS[\"Latitude\",NORTH]]"

# works!
Scanner$create(places)$ToTable()
#> Table
#> 7255 rows x 6 columns
#> $osm_id <string>
#> $code <int32>
#> $population <double>
#> $name <string>
#> $geometry <point GEOGCS["WGS 84",DATUM["WGS_...>
#> $fclass <string>
#> 
#> See $metadata for additional Schema metadata

# works!
as.data.frame(Scanner$create(places)$ToTable())
#> # A tibble: 7,255 × 6
#>    osm_id      code population name           geometry                    fclass
#>    <chr>      <int>      <dbl> <chr>          <wk_wkb>                    <chr> 
#>  1 21040334    1001      50781 Roskilde       <POINT (12.08192 55.64335)> city  
#>  2 21040360    1001      72398 Esbjerg        <POINT (8.452075 55.46649)> city  
#>  3 26559154    1001      62687 Randers        <POINT (10.03715 56.46175)> city  
#>  4 26559170    1001      60508 Kolding        <POINT (9.47905 55.4895)>   city  
#>  5 26559198    1001      56567 Vejle          <POINT (9.533324 55.70001)> city  
#>  6 26559213    1001     273077 Aarhus         <POINT (10.2134 56.14963)>  city  
#>  7 26559274    1001     178210 Odense         <POINT (10.38521 55.39972)> city  
#>  8 1368129781  1001      58646 Horsens        <POINT (9.844477 55.86117)> city  
#>  9 2247730880  1001     114194 Aalborg        <POINT (9.921526 57.04626)> city  
#> 10 393558713   1030          0 Englebjerggård <POINT (11.77737 55.2004)>  farm  
#> # … with 7,245 more rows

# unfortunately, this fails...
places %>% 
  filter(population > 100000) %>% 
  select(name, population, fclass, geometry) %>% 
  arrange(desc(population)) %>% 
  collect()
#> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
#> ! NotImplemented: concatenation of extension<geoarrow.point>
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195  VisitTypeInline(*out_->type, this)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590  ConcatenateImpl(data, pool).Concatenate(&out_data)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025  Concatenate(values.chunks(), ctx->memory_pool())
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084  TakeCA(*table.column(j), indices, options, ctx)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:375  impl_->DoFinish()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484  iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()

# ...unless we unregister the extension type and use geoarrow_collect()
arrow::unregister_extension_type("geoarrow.point")
open_dataset(places_folder) %>% 
  filter(population > 100000) %>% 
  select(name, population, fclass, geometry) %>% 
  arrange(desc(population)) %>% 
  geoarrow_collect()
#> # A tibble: 5 × 4
#>   name          population fclass           geometry                   
#>   <chr>              <dbl> <chr>            <wk_wkb>                   
#> 1 København         613288 national_capital <POINT (12.57007 55.68672)>
#> 2 Aarhus            273077 city             <POINT (10.2134 56.14963)> 
#> 3 Odense            178210 city             <POINT (10.38521 55.39972)>
#> 4 Aalborg           114194 city             <POINT (9.921526 57.04626)>
#> 5 Frederiksberg     102029 suburb           <POINT (12.53262 55.67802)>

^{Created on 2022-03-29 by the reprex package (v2.0.1)}

jonkeane

This looks great, and agree this structure is much easier to follow + cleaner. A few minor comments, but I'm happy to see this merged in (and we still have a bit of time if we need to clean it up before the release)

jonkeane · 2022-03-29T17:39:11Z

r/R/extension.R

I'm curious what the reason is for wrapping () here. I've seen it used to silence output, but both of these should already have no output, right?

I did it to explicitly print the output (that was more useful in the initial version, which was a reprex, than it is here in the example, although the printing part does affect the pkgdown output). Maybe best to remove it if it's confusing?

Oh, oh right it's the opposite of silencing 🤦. We can keep it in — in the actual examples on the pkgdown site it'll be obvious what's going on.

jonkeane · 2022-03-29T17:47:20Z

r/R/extension.R

This is minor (and we might also use this terminology like this elsewhere), but we might add something extra to this name so that it's clear that this is not serializing the array but rather just the metadata information (and same for Deserialize

I stole these names from the C++, but we can rename them to something else (an earlier version of this PR used extension_metadata() and extension_metadata_utf8(); the Python version also uses some alternative names so there's certainly precedent).

Yeah, those names read much more naturally to me

Any thoughts on the best name for Deserialize()? In Python this is __arrow_ext_deserialize__() (although does something slightly different)...maybe restore_extension()? restore()? It also could be left out (subclassers would be then forced to override initialize(xp), which might be more intuitive if the subclasser knows anything about R6).

deserialize_extension seems better than plain Deserialize (to me it reads more as being about the extension type itself and not the data...) Or maybe deserialize_ext_metadata?

Antoine suggested populate_instance(), although I think I like deserialize_extension() a bit better. I'll give it one more pass and see what feels the best.

I went with deserialize_instance() (we are deserializing, but maybe more clear that it's specific to this instance and not creating a new one?)

hydrate & dehydrate are common terms I've seen for this type of operation too: https://stackoverflow.com/questions/6991135/what-does-it-mean-to-hydrate-an-object

paleolimbot · 2022-03-29T19:20:51Z

I messaged Joris and Romain today asking for reviews...it's getting close but I think will benefit from their take!

…hod can be used elsewhere

…ector()

ursabot · 2022-04-09T10:30:57Z

Benchmark runs are scheduled for baseline = dd52b38 and contender = 489aada. 489aada is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.38% ⬆️0.0%] test-mac-arm
[Failed ⬇️2.5% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.68% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/477| 489aada5 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/462| 489aada5 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/463| 489aada5 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/472| 489aada5 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/476| dd52b384 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/461| dd52b384 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/462| dd52b384 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/471| dd52b384 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: R label Feb 18, 2022

paleolimbot mentioned this pull request Mar 1, 2022

ARROW-9235: [R] Support for connection class when reading and writing files #12323

Closed

jonkeane reviewed Mar 2, 2022

View reviewed changes

r/R/extension.R Outdated Show resolved Hide resolved

paleolimbot marked this pull request as ready for review March 17, 2022 14:12

jonkeane reviewed Mar 18, 2022

View reviewed changes

paleolimbot commented Mar 21, 2022

View reviewed changes

jonkeane reviewed Mar 21, 2022

View reviewed changes

r/R/extension.R Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Mar 22, 2022

View reviewed changes

jonkeane approved these changes Mar 29, 2022

View reviewed changes

paleolimbot added 6 commits April 7, 2022 15:32

initial working extension type

df6c1c6

simplify implementation

d7ea064

test subclasses

7d69c71

with some array infrastructure

2055776

with registration of custom subclasses

eec9407

get array and type subclasses working better

6abc0db

paleolimbot added 23 commits April 7, 2022 15:32

actually test UTF-8 conversion and printing of non-utf-8 metadata

30172b6

better example

df25ad1

Less redundant .Deserialize() method

26fb363

try to fix windows test

685e00a

move extension type definition to extension.h so that the Convert met…

66772d3

…hod can be used elsewhere

use Converter for extension types

f2b2e45

add license

5711594

reomve unused extension type method

93a0326

.ToString -> ToString()

6ae66f6

extension_metadata -> Serialize

b03988a

.Deserialize -> Deserialize

743f657

.ExtensionEquals -> ExtensionEquals

396b38e

consolidate .array_as_vector() and .chunked_array_as_vector() to as_v…

9aa909c

…ector()

better testing of array conversion

68245a6

test with vctrs extension type

ef459ee

undo changes to Table/RecordBatch

2d7fa62

fix typos, remove unused field in Converter_Extension

a0d658b

move extension type logic to the extension type converter

1d71a03

Use old ChunedArray$as_vector() method that used the converter API

374a6ff

update comments to remind about the future SafeCallIntoR.

b45d181

Serialize -> extension_metadata

4130110

Deserialize -> populate_instance

c105112

populate_instance() -> deserialize_instance()

2e5aa01

paleolimbot force-pushed the r-extension-type branch from edd407d to 2e5aa01 Compare April 7, 2022 18:34

paleolimbot added 2 commits April 7, 2022 15:58

use SafeCallIntoR()

46386cb

punt on SafeCallIntoR for Deserialize()

5c847cb

jonkeane closed this in 489aada Apr 8, 2022

paleolimbot deleted the r-extension-type branch December 9, 2022 16:35

ARROW-15471: [R] ExtensionType support in R #12467

ARROW-15471: [R] ExtensionType support in R #12467

Uh oh!

Conversation

paleolimbot commented Feb 18, 2022

Uh oh!

github-actions bot commented Feb 18, 2022

Uh oh!

github-actions bot commented Feb 18, 2022

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonkeane commented Mar 21, 2022

Uh oh!

paleolimbot commented Mar 22, 2022

Uh oh!

jonkeane commented Mar 22, 2022

Uh oh!

paleolimbot commented Mar 22, 2022

Uh oh!

paleolimbot commented Mar 22, 2022

Uh oh!

jorisvandenbossche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot commented Mar 22, 2022

Uh oh!

jonkeane commented Mar 22, 2022

Uh oh!

paleolimbot commented Mar 22, 2022

Uh oh!

jorisvandenbossche commented Mar 22, 2022

Uh oh!

paleolimbot commented Mar 28, 2022

Uh oh!

paleolimbot commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jorisvandenbossche left a comment •

edited

Loading

paleolimbot commented Mar 29, 2022 •

edited

Loading