-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15471: [R] ExtensionType support in R #12467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks solid, I don't have any specific comments (aside from a tiny style nit).
Would it be possible to include in the tests a more realistic example? It doesn't have to be as complicated as geo objects, but something that defines some data structure + shows how one might interact with it might be helpful.
This might be helpful or not, but my first thought was something kinda like haven:: labelled_spss() Not that we want to or need to support that, or that we want to emulate spss. But it's a funny(ish) type that is clearly outside of the scope for arrow types, but could have some interesting custom bits about them that might make the tests a bit more demonstrative of what this could do?
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, I would also love to get an opinion from @romainfrancois if he has time to take a look.
I also will pull this and mess around with it locally today or on Monday — but wanted to get these comments out before then.
r/R/arrow-tabular.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base here is a function that is either Table__to_dataframe or RecordBatch__to_dataframe yeah? Basically the constructor to be used if this isn't an extension type?
It might be nice to have a slightly more descriptive name for that?
r/R/arrow-tabular.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about this implementation because it's unintuitive...this gets used by Table$to_data_frame() and RecordBatch$to_data_frame() because both of those call into C++ to do their thing (but the C++ implementation doesn't know about extension types. Maybe it should?). Pretty much everwhere else we avoid looping over columns in R but that might be better than added complexity at the C++ level?
r/src/extension.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that I need to use std::shared_ptr<cpp11::environment> to store a the r6_class_ field here instead of cpp11::environment to avoid a crash, but I'm not entirely sure I'm using std::shared_ptr correctly.
r/src/extension.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main threading concern...the Deserialize() method gets called from other threads frequently but unless it's been passed through an R6 instance in R, we don't know if the metadata is valid or not.
r/R/extension.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made up the "dot prefix means protected method" thing here...I don't know if there is a convention for "protected"-style methods in R6 but would be happy to use it if it exists.
r/src/extension.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another threading concern...the data types can't be checked for equality if the serialized data is not identical and the comparison occurs on another thread. I think this might happen when reading a multi-file dataset if some of the files were written differently.
|
This is very exciting. Like I mentioned earlier, I wanted to try this out locally to see what this looks like. The example is a little contrived (and actually AFAIU, not totally accurate depending on the time of year!) Is it expected that roundtripping without the {vctrs} class wouldn't work? (Or did I do something wrong here? library(arrow, warn.conflicts = FALSE)
# Is this the minimal structure to create a custom class like this?
KoreanAge <- R6::R6Class(
"KoreanAge",
inherit = ExtensionType,
public = list(
.array_as_vector = function(extension_array) {
extension_array$storage()$as_vector() + 1
}
)
)
KoreanAge <- new_extension_type(
int32(),
"KoreanAge",
charToRaw("Korean Age, but stored as the western age value"),
type_class = KoreanAge
)
arr <- new_extension_array(c(0, 1, 2), KoreanAge)
# What we expect (storage + 1)
as.vector(arr)
#> [1] 1 2 3
# But roundtripping doesn't seem to work?
tf <- tempfile()
write_feather(arrow_table(col = arr), tf)
tab <- read_feather(tf, as_data_frame = FALSE)
type(tab$col)
#> Int32
#> int32
as.vector(tab$col)
#> [1] 0 1 2Also, should we export Do we have a follow on for what to do about printing the array? You'll see here you print the underlying storage type, which might be fine, but that has confused some folks before. |
|
The key step that was missing for the roundtrip was I should probably export Printing is a good point...definitely confusing in the case of an extension type! Detailslibrary(arrow, warn.conflicts = FALSE)
KoreanAge <- R6::R6Class(
"KoreanAge",
inherit = ExtensionType,
public = list(
.array_as_vector = function(extension_array) {
extension_array$storage()$as_vector() + 1
}
)
)
# constructor helpers
korean_age <- function() {
new_extension_type(
int32(),
"KoreanAge",
charToRaw("Korean Age, but stored as the western age value"),
type_class = KoreanAge
)
}
korean_age_array <- function(age_korean) {
new_extension_array(age_korean - 1, korean_age())
}
(arr <- korean_age_array(1:3))
#> ExtensionArray
#> <KoreanAge <Korean Age, but stored as the western age value>>
#> [
#> 0,
#> 1,
#> 2
#> ]
as.vector(arr)
#> [1] 1 2 3
# you need to register the type for Arrow C++ to keep the extension type
# slash metadata when it encounters it at the C++ level (import from C
# and reading files)
register_extension_type(korean_age())
tf <- tempfile()
write_feather(arrow_table(col = arr), tf)
tab <- read_feather(tf, as_data_frame = FALSE)
type(tab$col)
#> KoreanAge
#> KoreanAge <Korean Age, but stored as the western age value>
as.vector(tab$col)
#> [1] 0 1 2 |
|
Aaaah, yeah I totally missed Though weirdly(?) the
In other places we've exposed both, which I think isn't bad here (it's slightly more API we manage, but having the R6 stuff exposed makes it easier to extend, and having the
Hmm, if we think it's ok to do the |
|
Ah yes, you clearly should have remembered to implement |
I think we have to punt on the printing...there isn't a way to customize how other Array objects are printed (it all goes through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool!
Looking a little bit through the code and the test cases, I am wondering if what we call in python "parametrized" extension types are possible. There is a test case that ensures you can roundtrip an extension type with a different storage type as how it was registered. But those test examples only override Deserialize; is it also possible to override Serialize? (so you can store some custom field in there).
|
I think so! This example is probably better than the example I have in there right now because the serializing/deserializing of the metadata is a big part of the picture and the current documentation example only implements the array-to-r conversion. Check to make sure it's what you meant though! I didn't implement this quite in the same way as the Python one...I think in Python the workflow (and correct me if I'm wrong) is along the lines of
In R it's totally bananas to call from C++ into R and we can't do it safely most of the time. So instead I wrote it like:
It isn't all that straightforward to do (the way I've implemented it in R) and I'm not sure I like how it's implemented (but I'm also not sure how to make it better). Detailslibrary(arrow, warn.conflicts = FALSE)
QuantizedType <- R6::R6Class(
"QuantizedType",
inherit = ExtensionType,
public = list(
center = function() private$.center,
scale = function() private$.scale,
.array_as_vector = function(extension_array) {
as.vector(extension_array$storage() / private$.scale + private$.center)
},
.Deserialize = function(storage_type, extension_name, extension_metadata) {
parsed <- jsonlite::fromJSON(self$extension_metadata_utf8())
private$.center <- as.double(parsed$center)
private$.scale <- as.double(parsed$scale)
}
),
private = list(
.center = NULL,
.scale = NULL
)
)
quantized <- function(center = 0, scale = 1, storage_type = int32()) {
new_extension_type(
storage_type = storage_type,
extension_name = "arrow.example.quantized",
extension_metadata = jsonlite::toJSON(
list(
center = jsonlite::unbox(as.double(center)),
scale = jsonlite::unbox(as.double(scale))
)
),
type_class = QuantizedType
)
}
quantized_array <- function(x, center = 0, scale = 1,
storage_type = int32()) {
type <- quantized(center, scale, storage_type)
new_extension_array(
Array$create((x - center) * scale, type = storage_type),
type
)
}
reregister_extension_type(quantized())
(vals <- runif(5, min = 19, max = 21))
#> [1] 19.33526 19.47467 19.14288 20.39798 19.04523
(array <- quantized_array(
vals,
center = 20,
scale = 2 ^ 15 - 1,
storage_type = int16())
)
#> ExtensionArray
#> <QuantizedType <{"center":20,"scale":32767}>>
#> [
#> -21781,
#> -17213,
#> -28085,
#> 13040,
#> -31284
#> ]
array$type$center()
#> [1] 20
array$type$scale()
#> [1] 32767
as.vector(array)
#> [1] 19.33528 19.47468 19.14289 20.39796 19.04526 |
|
I haven't dug too deeply yet, but the |
|
I like it too! I like that it isn't R specific, that it needs parameterization, and that transforming it back into an R vector needs a calculation. I didn't do any explaining in the docs about why quantized types are cool...I've used them in raster GIS stuff because they enable really efficient storage (in exchange for some precision loss). |
That example indeed nicely shows that it is possible.
I think the fact that you already determine the serialized metadata upfront in R is fine / nice (the fact that in Python this is C++ calling back into python is kind of a complication, as the metadata could be known at the point when instantiating the python extension type instance). |
|
Sorry Joris for the late reply!
That's a really good question...it's might possible to do this the other way around, which would make it a little more intuitive. Creating the R6 instance from C++ (usually via automatically generated wrapper code) is the default...the default constructor errors if there isn't a valid one: Lines 310 to 322 in 1b796ec
The alternative would be to materialize the C++ object when it is required. I will play with that because I really don't like the dot prefix thing that I'm currently doing. |
|
A few more modifications:
Details# remotes::install_github("apache/arrow#12467")
# remotes::install_github("paleolimbot/geoarrow@arrow-ext-type")
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(geoarrow)
places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
places <- open_dataset(places_folder)
places$schema$geometry$type
#> GeoArrowType
#> point GEOGCS["WGS 84",DATUM["WGS_...
places$schema$geometry$type$crs
#> [1] "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],AXIS[\"Latitude\",NORTH]]"
# works!
Scanner$create(places)$ToTable()
#> Table
#> 7255 rows x 6 columns
#> $osm_id <string>
#> $code <int32>
#> $population <double>
#> $name <string>
#> $geometry <point GEOGCS["WGS 84",DATUM["WGS_...>
#> $fclass <string>
#>
#> See $metadata for additional Schema metadata
# works!
as.data.frame(Scanner$create(places)$ToTable())
#> # A tibble: 7,255 × 6
#> osm_id code population name geometry fclass
#> <chr> <int> <dbl> <chr> <wk_wkb> <chr>
#> 1 21040334 1001 50781 Roskilde <POINT (12.08192 55.64335)> city
#> 2 21040360 1001 72398 Esbjerg <POINT (8.452075 55.46649)> city
#> 3 26559154 1001 62687 Randers <POINT (10.03715 56.46175)> city
#> 4 26559170 1001 60508 Kolding <POINT (9.47905 55.4895)> city
#> 5 26559198 1001 56567 Vejle <POINT (9.533324 55.70001)> city
#> 6 26559213 1001 273077 Aarhus <POINT (10.2134 56.14963)> city
#> 7 26559274 1001 178210 Odense <POINT (10.38521 55.39972)> city
#> 8 1368129781 1001 58646 Horsens <POINT (9.844477 55.86117)> city
#> 9 2247730880 1001 114194 Aalborg <POINT (9.921526 57.04626)> city
#> 10 393558713 1030 0 Englebjerggård <POINT (11.77737 55.2004)> farm
#> # … with 7,245 more rows
# unfortunately, this fails...
places %>%
filter(population > 100000) %>%
select(name, population, fclass, geometry) %>%
arrange(desc(population)) %>%
collect()
#> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
#> ! NotImplemented: concatenation of extension<geoarrow.point>
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 VisitTypeInline(*out_->type, this)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 ConcatenateImpl(data, pool).Concatenate(&out_data)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025 Concatenate(values.chunks(), ctx->memory_pool())
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084 TakeCA(*table.column(j), indices, options, ctx)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:375 impl_->DoFinish()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484 iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351 ToRecordBatches()
# ...unless we unregister the extension type and use geoarrow_collect()
arrow::unregister_extension_type("geoarrow.point")
open_dataset(places_folder) %>%
filter(population > 100000) %>%
select(name, population, fclass, geometry) %>%
arrange(desc(population)) %>%
geoarrow_collect()
#> # A tibble: 5 × 4
#> name population fclass geometry
#> <chr> <dbl> <chr> <wk_wkb>
#> 1 København 613288 national_capital <POINT (12.57007 55.68672)>
#> 2 Aarhus 273077 city <POINT (10.2134 56.14963)>
#> 3 Odense 178210 city <POINT (10.38521 55.39972)>
#> 4 Aalborg 114194 city <POINT (9.921526 57.04626)>
#> 5 Frederiksberg 102029 suburb <POINT (12.53262 55.67802)>Created on 2022-03-29 by the reprex package (v2.0.1) |
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, and agree this structure is much easier to follow + cleaner. A few minor comments, but I'm happy to see this merged in (and we still have a bit of time if we need to clean it up before the release)
r/R/extension.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what the reason is for wrapping () here. I've seen it used to silence output, but both of these should already have no output, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it to explicitly print the output (that was more useful in the initial version, which was a reprex, than it is here in the example, although the printing part does affect the pkgdown output). Maybe best to remove it if it's confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, oh right it's the opposite of silencing 🤦. We can keep it in — in the actual examples on the pkgdown site it'll be obvious what's going on.
r/R/extension.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor (and we might also use this terminology like this elsewhere), but we might add something extra to this name so that it's clear that this is not serializing the array but rather just the metadata information (and same for Deserialize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stole these names from the C++, but we can rename them to something else (an earlier version of this PR used extension_metadata() and extension_metadata_utf8(); the Python version also uses some alternative names so there's certainly precedent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, those names read much more naturally to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on the best name for Deserialize()? In Python this is __arrow_ext_deserialize__() (although does something slightly different)...maybe restore_extension()? restore()? It also could be left out (subclassers would be then forced to override initialize(xp), which might be more intuitive if the subclasser knows anything about R6).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deserialize_extension seems better than plain Deserialize (to me it reads more as being about the extension type itself and not the data...) Or maybe deserialize_ext_metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Antoine suggested populate_instance(), although I think I like deserialize_extension() a bit better. I'll give it one more pass and see what feels the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with deserialize_instance() (we are deserializing, but maybe more clear that it's specific to this instance and not creating a new one?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hydrate & dehydrate are common terms I've seen for this type of operation too: https://stackoverflow.com/questions/6991135/what-does-it-mean-to-hydrate-an-object
|
I messaged Joris and Romain today asking for reviews...it's getting close but I think will benefit from their take! |
…hod can be used elsewhere
edd407d to
2e5aa01
Compare
|
Benchmark runs are scheduled for baseline = dd52b38 and contender = 489aada. 489aada is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR implements extension type support and registration in the R bindings (as has been possible in the Python bindings for some time). The details still need to be worked out, but we at least have a working pattern:
Created on 2022-02-18 by the reprex package (v2.0.1)