-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array #7645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Any reason why library(arrow, warn.conflicts = FALSE)
f1 <- factor(c("a", "a"), levels = c("a", "b"))
f2 <- factor(c("c"), levels = c("c", "d"))
f3 <- factor(NA, levels = c("d"))
ca <- ChunkedArray$create(f1, f2, f3)
ca
#> ChunkedArray
#> <dictionary<values=string, indices=int8>>
#>
#> -- dictionary:
#> [
#> "a",
#> "b"
#> ]
#> -- indices:
#> [
#> 0,
#> 0
#> ]Created on 2020-07-07 by the reprex package (v0.3.0.9001) I have a stashed commit that makes this: library(arrow, warn.conflicts = FALSE)
f1 <- factor(c("a", "a"), levels = c("a", "b"))
f2 <- factor(c("c"), levels = c("c", "d"))
f3 <- factor(NA, levels = c("d"))
ca <- ChunkedArray$create(f1, f2, f3)
ca
#> ChunkedArray
#> [
#>
#> -- dictionary:
#> [
#> "a",
#> "b"
#> ]
#> -- indices:
#> [
#> 0,
#> 0
#> ],
#>
#> -- dictionary:
#> [
#> "c",
#> "d"
#> ]
#> -- indices:
#> [
#> 0
#> ],
#>
#> -- dictionary:
#> [
#> "d"
#> ]
#> -- indices:
#> [
#> null
#> ]
#> ]Created on 2020-07-07 by the reprex package (v0.3.0.9001) I can put this on another jira/pr though. Independently, should the printed dictionary be the unified one ? For now, this PR only unifies on conversion back to R, but that does not seem right ? |
|
In other words, when creating a chunked array from a list of factors from R, should the dictionary be unified and shared across the arrays of the chunked array ? |
|
I think we can leave this for a follow up: as this will require some additional |
|
Re: ChunkedArray print method, Re: dictionariesAsFactors, that's ARROW-7657, fine to keep it out of scope here. |
|
Fair enough, I now remember that levels (or whatever they are called in arrow) used to be part of the type, but not anymore, so no unification until they need to be converted back to a single R factor, which is then absolutely needed and the business of this PR. I will apply my stash, and so let arrow deal with the printing. Perhaps at some point we can have something similar to |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests make sense to me but I'm not sure I can properly review the Rcpp changes. Would also be good for @wesm to review and make sure that this covers all of the requirements.
r/src/array_to_vector.cpp
Outdated
|
|
||
| Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array, | ||
| R_xlen_t start, R_xlen_t n) const { | ||
| R_xlen_t start, R_xlen_t n, size_t array_index) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these unrelated converters need an additional arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this only gets called from Converter::Ingest_some_nulls() so they all need to have the same interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the dispatch could rather be done by arrow::VisitTypeInline() but I'd rather do this in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have a uniform interface. I would like a comment indicating that most implementations will ignore the chunk_index and that (IIUC) only the dictionary conversion path currently uses it.
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach makes sense to me; we only unify dictionaries when necessary due to conversion to R factors while ChunkedArrays maintain the capacity to store per-chunk dictionaries. A few changes for clarity:
r/src/array_to_vector.cpp
Outdated
|
|
||
| Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array, | ||
| R_xlen_t start, R_xlen_t n) const { | ||
| R_xlen_t start, R_xlen_t n, size_t array_index) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have a uniform interface. I would like a comment indicating that most implementations will ignore the chunk_index and that (IIUC) only the dictionary conversion path currently uses it.
|
Thanks! |
This needs some testing:
Created on 2020-07-06 by the reprex package (v0.3.0.9001)