Skip to content

Conversation

@paleolimbot
Copy link
Member

This PR implements extension type support and registration in the R bindings (as has been possible in the Python bindings for some time). The details still need to be worked out, but we at least have a working pattern:

library(arrow, warn.conflicts = FALSE)
library(R6)

SomeExtensionTypeSubclass <- R6Class(
  "SomeExtensionTypeSubclass", inherit = arrow:::ExtensionType,
  public = list(
    some_custom_method = function() {
      private$some_custom_field
    },
    
    .Deserialize = function(storage_type, extension_name, extension_metadata) {
      private$some_custom_field <- head(extension_metadata, 5)
    }
  ),
  private = list(
    some_custom_field = NULL
  )
)

SomeExtensionArraySubclass <- R6Class(
  "SomeExtensionArraySubclass", inherit = arrow:::ExtensionArray,
  public = list(
    some_custom_method = function() {
      self$type$some_custom_method()
    }
  )
)

type <- arrow:::MakeExtensionType(
  int32(),
  "some_extension_subclass",
  charToRaw("some custom metadata"),
  type_class = SomeExtensionTypeSubclass,
  array_class = SomeExtensionArraySubclass
)


arrow:::RegisterExtensionType(type)

# survives the C API round trip
ptr_type <- arrow:::allocate_arrow_schema()
type$export_to_c(ptr_type)
type2 <- arrow:::DataType$import_from_c(ptr_type)


type2
#> SomeExtensionTypeSubclass
#> SomeExtensionTypeSubclass <some custom metadata>
type2$some_custom_method()
#> [1] 73 6f 6d 65 20

(array <- type$WrapArray(Array$create(1:10)))
#> SomeExtensionArraySubclass
#> <SomeExtensionTypeSubclass <some custom metadata>>
#> [
#>   1,
#>   2,
#>   3,
#>   4,
#>   5,
#>   6,
#>   7,
#>   8,
#>   9,
#>   10
#> ]
array$some_custom_method()
#> [1] 73 6f 6d 65 20

ptr_array <- arrow:::allocate_arrow_array()
array$export_to_c(ptr_array, ptr_type)
(array2 <- Array$import_from_c(ptr_array, ptr_type))
#> SomeExtensionArraySubclass
#> <SomeExtensionTypeSubclass <some custom metadata>>
#> [
#>   1,
#>   2,
#>   3,
#>   4,
#>   5,
#>   6,
#>   7,
#>   8,
#>   9,
#>   10
#> ]

arrow:::delete_arrow_schema(ptr_type)
arrow:::delete_arrow_array(ptr_array)

Created on 2022-02-18 by the reprex package (v2.0.1)

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks solid, I don't have any specific comments (aside from a tiny style nit).

Would it be possible to include in the tests a more realistic example? It doesn't have to be as complicated as geo objects, but something that defines some data structure + shows how one might interact with it might be helpful.

This might be helpful or not, but my first thought was something kinda like haven:: labelled_spss() Not that we want to or need to support that, or that we want to emulate spss. But it's a funny(ish) type that is clearly outside of the scope for arrow types, but could have some interesting custom bits about them that might make the tests a bit more demonstrative of what this could do?

@paleolimbot paleolimbot marked this pull request as ready for review March 17, 2022 14:12
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, I would also love to get an opinion from @romainfrancois if he has time to take a look.

I also will pull this and mess around with it locally today or on Monday — but wanted to get these comments out before then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base here is a function that is either Table__to_dataframe or RecordBatch__to_dataframe yeah? Basically the constructor to be used if this isn't an extension type?

It might be nice to have a slightly more descriptive name for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about this implementation because it's unintuitive...this gets used by Table$to_data_frame() and RecordBatch$to_data_frame() because both of those call into C++ to do their thing (but the C++ implementation doesn't know about extension types. Maybe it should?). Pretty much everwhere else we avoid looping over columns in R but that might be better than added complexity at the C++ level?

Comment on lines 33 to 36
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that I need to use std::shared_ptr<cpp11::environment> to store a the r6_class_ field here instead of cpp11::environment to avoid a crash, but I'm not entirely sure I'm using std::shared_ptr correctly.

Comment on lines 136 to 146
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main threading concern...the Deserialize() method gets called from other threads frequently but unless it's been passed through an R6 instance in R, we don't know if the metadata is valid or not.

r/R/extension.R Outdated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made up the "dot prefix means protected method" thing here...I don't know if there is a convention for "protected"-style methods in R6 but would be happy to use it if it exists.

Comment on lines 103 to 108
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another threading concern...the data types can't be checked for equality if the serialized data is not identical and the comparison occurs on another thread. I think this might happen when reading a multi-file dataset if some of the files were written differently.

@jonkeane
Copy link
Member

This is very exciting. Like I mentioned earlier, I wanted to try this out locally to see what this looks like. The example is a little contrived (and actually AFAIU, not totally accurate depending on the time of year!)

Is it expected that roundtripping without the {vctrs} class wouldn't work? (Or did I do something wrong here?

library(arrow, warn.conflicts = FALSE)

# Is this the minimal structure to create a custom class like this?
KoreanAge <- R6::R6Class(
  "KoreanAge", 
  inherit = ExtensionType,
  public = list(
    .array_as_vector = function(extension_array) {
      extension_array$storage()$as_vector() + 1
    }
  )
)

KoreanAge <- new_extension_type(
  int32(),
  "KoreanAge",
  charToRaw("Korean Age, but stored as the western age value"),
  type_class = KoreanAge
)

arr <- new_extension_array(c(0, 1, 2), KoreanAge)

# What we expect (storage + 1)
as.vector(arr)
#> [1] 1 2 3

# But roundtripping doesn't seem to work?
tf <- tempfile()
write_feather(arrow_table(col = arr), tf)

tab <- read_feather(tf, as_data_frame = FALSE)

type(tab$col)
#> Int32
#> int32

as.vector(tab$col)
#> [1] 0 1 2

Also, should we export ExtensionArray? It doesn't look like it is, but we do have Array etc. exported. The docs additions are really great + descriptive. But I do wonder if an example (or two) would be nice, even if they were pretty trivial extension type like this (or some of the vctrs examples with percentages and the like).

Do we have a follow on for what to do about printing the array? You'll see here you print the underlying storage type, which might be fine, but that has confused some folks before.

@paleolimbot
Copy link
Member Author

The key step that was missing for the roundtrip was register_extension_type(), which is needed so that Arrow C++ knows not to discard the extension metadata when it encounters the type! (see details).

I should probably export ExtensionArray and use ExtensionArray$create() rather than new_extension_array() since it's more arrow-ish to do that. Maybe ExtensionType$create() instead of new_extension_type() is where extension type creation should go, too.

Printing is a good point...definitely confusing in the case of an extension type!

Details
library(arrow, warn.conflicts = FALSE)

KoreanAge <- R6::R6Class(
  "KoreanAge", 
  inherit = ExtensionType,
  public = list(
    .array_as_vector = function(extension_array) {
      extension_array$storage()$as_vector() + 1
    }
  )
)

# constructor helpers
korean_age <- function() {
  new_extension_type(
    int32(),
    "KoreanAge",
    charToRaw("Korean Age, but stored as the western age value"),
    type_class = KoreanAge
  )
}

korean_age_array <- function(age_korean) {
  new_extension_array(age_korean - 1, korean_age())
}

(arr <- korean_age_array(1:3))
#> ExtensionArray
#> <KoreanAge <Korean Age, but stored as the western age value>>
#> [
#>   0,
#>   1,
#>   2
#> ]
as.vector(arr)
#> [1] 1 2 3

# you need to register the type for Arrow C++ to keep the extension type
# slash metadata when it encounters it at the C++ level (import from C
# and reading files)
register_extension_type(korean_age())

tf <- tempfile()
write_feather(arrow_table(col = arr), tf)

tab <- read_feather(tf, as_data_frame = FALSE)

type(tab$col)
#> KoreanAge
#> KoreanAge <Korean Age, but stored as the western age value>

as.vector(tab$col)
#> [1] 0 1 2

@jonkeane
Copy link
Member

Aaaah, yeah I totally missed register_extension_type() oops!

Though weirdly(?) the as.vector() here is still wrong, it should be 1 2 3 (since the ->R conversion on this Extension type should have 1 added to it)

# you need to register the type for Arrow C++ to keep the extension type
# slash metadata when it encounters it at the C++ level (import from C
# and reading files)
register_extension_type(korean_age())

tf <- tempfile()
write_feather(arrow_table(col = arr), tf)

tab <- read_feather(tf, as_data_frame = FALSE)

type(tab$col)
#> KoreanAge
#> KoreanAge <Korean Age, but stored as the western age value>

as.vector(tab$col)
#> [1] 0 1 2

I should probably export ExtensionArray and use ExtensionArray$create() rather than new_extension_array() since it's more arrow-ish to do that.

In other places we've exposed both, which I think isn't bad here (it's slightly more API we manage, but having the R6 stuff exposed makes it easier to extend, and having the foo_bar_baz looks nicer | is a bit more friendly to users|doesn't force someone to learn all the R6 if what they want or need is available there)...

Maybe ExtensionType$create() instead of new_extension_type() is where extension type creation should go, too.

Hmm, if we think it's ok to do the new_extension_type() inside of ExtensionType$create() I would agree that having a second step like that is unnecessary. Though with the proper docs having the R6 way of establishing these need to be slightly more verbose and you need to manage other things is inline with our other implementations of things like that (e.g. creating one's own filesystem, etc.)

@paleolimbot
Copy link
Member Author

Ah yes, you clearly should have remembered to implement KoreanAge$.chunked_array_as_vector() (...I'll implement a better default method for that one...)

@paleolimbot
Copy link
Member Author

Printing is a good point...definitely confusing in the case of an extension type!

I think we have to punt on the printing...there isn't a way to customize how other Array objects are printed (it all goes through ChunkedArray::ToString()/Array::ToString() at the C++ level and that is very specifically a PrettyPrint() C++ thing. We have some tools available in R to improve the printing of all ArrowTabular/ChunkedArray/Arrays, and maybe a PR with that as its scope would be more appropriate.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool!

Looking a little bit through the code and the test cases, I am wondering if what we call in python "parametrized" extension types are possible. There is a test case that ensures you can roundtrip an extension type with a different storage type as how it was registered. But those test examples only override Deserialize; is it also possible to override Serialize? (so you can store some custom field in there).

@paleolimbot
Copy link
Member Author

I think so! This example is probably better than the example I have in there right now because the serializing/deserializing of the metadata is a big part of the picture and the current documentation example only implements the array-to-r conversion. Check to make sure it's what you meant though!

I didn't implement this quite in the same way as the Python one...I think in Python the workflow (and correct me if I'm wrong) is along the lines of

  • Create a ExtensionTypeSubclass(with, parameters, like, this) instance
  • C++ calls the __arrow_ext_serialize__ method of the Python instance when the serialized metadata is needed

In R it's totally bananas to call from C++ into R and we can't do it safely most of the time. So instead I wrote it like:

  • Create the extension metadata before either the R or C++ instance is created
  • Create a C++ instance of RExtensionType() that contains the definitive copy of the serialized extension metadata
  • Create the ExtensionTypeSubclass R6 instance and then call the R6 instance's .Deserialize() method to populate data fields.

It isn't all that straightforward to do (the way I've implemented it in R) and I'm not sure I like how it's implemented (but I'm also not sure how to make it better).

Details
library(arrow, warn.conflicts = FALSE)

QuantizedType <- R6::R6Class(
  "QuantizedType", 
  inherit = ExtensionType,
  public = list(
    center = function() private$.center,
    scale = function() private$.scale,
    
    .array_as_vector = function(extension_array) {
      as.vector(extension_array$storage() / private$.scale + private$.center)
    },
    
    .Deserialize = function(storage_type, extension_name, extension_metadata) {
      parsed <- jsonlite::fromJSON(self$extension_metadata_utf8())
      private$.center <- as.double(parsed$center)
      private$.scale <- as.double(parsed$scale)
    }
  ),
  private = list(
    .center = NULL,
    .scale = NULL
  )
)

quantized <- function(center = 0, scale = 1, storage_type = int32()) {
  new_extension_type(
    storage_type = storage_type,
    extension_name = "arrow.example.quantized",
    extension_metadata = jsonlite::toJSON(
      list(
        center = jsonlite::unbox(as.double(center)),
        scale = jsonlite::unbox(as.double(scale))
      )
    ),
    type_class = QuantizedType
  )
}

quantized_array <- function(x, center = 0, scale = 1, 
                            storage_type = int32()) {
  type <- quantized(center, scale, storage_type)
  new_extension_array(
    Array$create((x - center) * scale, type = storage_type),
    type
  )
}

reregister_extension_type(quantized())

(vals <- runif(5, min = 19, max = 21))
#> [1] 19.33526 19.47467 19.14288 20.39798 19.04523

(array <- quantized_array(
  vals,
  center = 20,
  scale = 2 ^ 15 - 1,
  storage_type = int16())
)
#> ExtensionArray
#> <QuantizedType <{"center":20,"scale":32767}>>
#> [
#>   -21781,
#>   -17213,
#>   -28085,
#>   13040,
#>   -31284
#> ]

array$type$center()
#> [1] 20
array$type$scale()
#> [1] 32767

as.vector(array)
#> [1] 19.33528 19.47468 19.14289 20.39796 19.04526

@jonkeane
Copy link
Member

I haven't dug too deeply yet, but the QuantizedType example is fantastic. I really like that _ think it shows off what/how this is capable of doing in a way that the previous examples (both mine and the ones in the test) didn't quite fully get to

@paleolimbot
Copy link
Member Author

I like it too! I like that it isn't R specific, that it needs parameterization, and that transforming it back into an R vector needs a calculation. I didn't do any explaining in the docs about why quantized types are cool...I've used them in raster GIS stuff because they enable really efficient storage (in exchange for some precision loss).

@jorisvandenbossche
Copy link
Member

I think so! This example is probably better

That example indeed nicely shows that it is possible.
I suppose one thing that confused me is that new_extension_type(..) rather creates a new instance of a given type, and not a new type itself (and thus this allows you to create instances of a given extension type with different metadata).

It isn't all that straightforward to do (the way I've implemented it in R) and I'm not sure I like how it's implemented (but I'm also not sure how to make it better).

I think the fact that you already determine the serialized metadata upfront in R is fine / nice (the fact that in Python this is C++ calling back into python is kind of a complication, as the metadata could be known at the point when instantiating the python extension type instance).
One thing I was wondering: the reason you can't define a .Serialize method on the QuantizedType that can hold this logic to create the serialized metadata (instead of doing that inside new_extension_type( ..., extension_metadata=... )), is that this class instance is only created after the underlying C++ extension type, for which you already need the serialized metadata? It's not possible to first initialize the QuantizedType object, and let that call Serialize to then create the underlying C++ object? (disclaimer: I know nothing about R classes :))

@paleolimbot
Copy link
Member Author

Sorry Joris for the late reply!

is that this class instance is only created after the underlying C++ extension type, for which you already need the serialized metadata?

That's a really good question...it's might possible to do this the other way around, which would make it a little more intuitive. Creating the R6 instance from C++ (usually via automatically generated wrapper code) is the default...the default constructor errors if there isn't a valid one:

arrow/r/R/arrow-package.R

Lines 310 to 322 in 1b796ec

initialize = function(xp) self$set_pointer(xp),
pointer = function() get(".:xp:.", envir = self),
`.:xp:.` = NULL,
set_pointer = function(xp) {
if (!inherits(xp, "externalptr")) {
stop(
class(self)[1], "$new() requires a pointer as input: ",
"did you mean $create() instead?",
call. = FALSE
)
}
assign(".:xp:.", xp, envir = self)
},

The alternative would be to materialize the C++ object when it is required. I will play with that because I really don't like the dot prefix thing that I'm currently doing.

@paleolimbot
Copy link
Member Author

paleolimbot commented Mar 29, 2022

A few more modifications:

  • I moved conversion of extension types to R objects into the Converter API and removed the modifications to ChunkedArray, Table, and RecordBatch (Romain suggested this). This feels a lot better and is more likely to "just work" in more places.
  • I implemented the geoarrow side of this to make sure it will work. It does (except for a bit in one of the compute kernels where Concatenate doesn't work for extension types)! See details and Implement arrow::ExtensionType geoarrow/geoarrow-r#7
  • I did play with reversing the order of instantiation of the C++ and R6 objects...I think that change is a kind of a big one with respect to how all objects get passed around in our current implementation. I did, however, rename the methods to match the C++ method names and it feels a lot better now.
Details
# remotes::install_github("apache/arrow#12467")
# remotes::install_github("paleolimbot/geoarrow@arrow-ext-type")
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(geoarrow)

places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
places <- open_dataset(places_folder)
places$schema$geometry$type
#> GeoArrowType
#> point GEOGCS["WGS 84",DATUM["WGS_...
places$schema$geometry$type$crs
#> [1] "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],AXIS[\"Latitude\",NORTH]]"

# works!
Scanner$create(places)$ToTable()
#> Table
#> 7255 rows x 6 columns
#> $osm_id <string>
#> $code <int32>
#> $population <double>
#> $name <string>
#> $geometry <point GEOGCS["WGS 84",DATUM["WGS_...>
#> $fclass <string>
#> 
#> See $metadata for additional Schema metadata

# works!
as.data.frame(Scanner$create(places)$ToTable())
#> # A tibble: 7,255 × 6
#>    osm_id      code population name           geometry                    fclass
#>    <chr>      <int>      <dbl> <chr>          <wk_wkb>                    <chr> 
#>  1 21040334    1001      50781 Roskilde       <POINT (12.08192 55.64335)> city  
#>  2 21040360    1001      72398 Esbjerg        <POINT (8.452075 55.46649)> city  
#>  3 26559154    1001      62687 Randers        <POINT (10.03715 56.46175)> city  
#>  4 26559170    1001      60508 Kolding        <POINT (9.47905 55.4895)>   city  
#>  5 26559198    1001      56567 Vejle          <POINT (9.533324 55.70001)> city  
#>  6 26559213    1001     273077 Aarhus         <POINT (10.2134 56.14963)>  city  
#>  7 26559274    1001     178210 Odense         <POINT (10.38521 55.39972)> city  
#>  8 1368129781  1001      58646 Horsens        <POINT (9.844477 55.86117)> city  
#>  9 2247730880  1001     114194 Aalborg        <POINT (9.921526 57.04626)> city  
#> 10 393558713   1030          0 Englebjerggård <POINT (11.77737 55.2004)>  farm  
#> # … with 7,245 more rows

# unfortunately, this fails...
places %>% 
  filter(population > 100000) %>% 
  select(name, population, fclass, geometry) %>% 
  arrange(desc(population)) %>% 
  collect()
#> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
#> ! NotImplemented: concatenation of extension<geoarrow.point>
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195  VisitTypeInline(*out_->type, this)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590  ConcatenateImpl(data, pool).Concatenate(&out_data)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025  Concatenate(values.chunks(), ctx->memory_pool())
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084  TakeCA(*table.column(j), indices, options, ctx)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:375  impl_->DoFinish()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484  iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()

# ...unless we unregister the extension type and use geoarrow_collect()
arrow::unregister_extension_type("geoarrow.point")
open_dataset(places_folder) %>% 
  filter(population > 100000) %>% 
  select(name, population, fclass, geometry) %>% 
  arrange(desc(population)) %>% 
  geoarrow_collect()
#> # A tibble: 5 × 4
#>   name          population fclass           geometry                   
#>   <chr>              <dbl> <chr>            <wk_wkb>                   
#> 1 København         613288 national_capital <POINT (12.57007 55.68672)>
#> 2 Aarhus            273077 city             <POINT (10.2134 56.14963)> 
#> 3 Odense            178210 city             <POINT (10.38521 55.39972)>
#> 4 Aalborg           114194 city             <POINT (9.921526 57.04626)>
#> 5 Frederiksberg     102029 suburb           <POINT (12.53262 55.67802)>

Created on 2022-03-29 by the reprex package (v2.0.1)

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, and agree this structure is much easier to follow + cleaner. A few minor comments, but I'm happy to see this merged in (and we still have a bit of time if we need to clean it up before the release)

r/R/extension.R Outdated
Comment on lines +374 to +383
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the reason is for wrapping () here. I've seen it used to silence output, but both of these should already have no output, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it to explicitly print the output (that was more useful in the initial version, which was a reprex, than it is here in the example, although the printing part does affect the pkgdown output). Maybe best to remove it if it's confusing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, oh right it's the opposite of silencing 🤦. We can keep it in — in the actual examples on the pkgdown site it'll be obvious what's going on.

r/R/extension.R Outdated
Comment on lines 139 to 143
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor (and we might also use this terminology like this elsewhere), but we might add something extra to this name so that it's clear that this is not serializing the array but rather just the metadata information (and same for Deserialize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stole these names from the C++, but we can rename them to something else (an earlier version of this PR used extension_metadata() and extension_metadata_utf8(); the Python version also uses some alternative names so there's certainly precedent).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, those names read much more naturally to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the best name for Deserialize()? In Python this is __arrow_ext_deserialize__() (although does something slightly different)...maybe restore_extension()? restore()? It also could be left out (subclassers would be then forced to override initialize(xp), which might be more intuitive if the subclasser knows anything about R6).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deserialize_extension seems better than plain Deserialize (to me it reads more as being about the extension type itself and not the data...) Or maybe deserialize_ext_metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Antoine suggested populate_instance(), although I think I like deserialize_extension() a bit better. I'll give it one more pass and see what feels the best.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with deserialize_instance() (we are deserializing, but maybe more clear that it's specific to this instance and not creating a new one?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hydrate & dehydrate are common terms I've seen for this type of operation too: https://stackoverflow.com/questions/6991135/what-does-it-mean-to-hydrate-an-object

@paleolimbot
Copy link
Member Author

I messaged Joris and Romain today asking for reviews...it's getting close but I think will benefit from their take!

@jonkeane jonkeane closed this in 489aada Apr 8, 2022
@ursabot
Copy link

ursabot commented Apr 9, 2022

Benchmark runs are scheduled for baseline = dd52b38 and contender = 489aada. 489aada is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.38% ⬆️0.0%] test-mac-arm
[Failed ⬇️2.5% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.68% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/477| 489aada5 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/462| 489aada5 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/463| 489aada5 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/472| 489aada5 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/476| dd52b384 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/461| dd52b384 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/462| dd52b384 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/471| dd52b384 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@paleolimbot paleolimbot deleted the r-extension-type branch December 9, 2022 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants