Skip to content

Conversation

@romainfrancois
Copy link
Contributor

This needs some testing:

library(arrow, warn.conflicts = FALSE)

f1 <- factor(c("a"), levels = c("a", "b"))
f2 <- factor(c("c"), levels = c("c", "d"))

ca <- ChunkedArray$create(f1, f2)
ca$as_vector()
#> [1] a c
#> Levels: a b c d
ca$type
#> DictionaryType
#> dictionary<values=string, indices=int8>

tab <- Table$create(
  record_batch(f = f1), 
  record_batch(f = f2)
)
tab
#> Table
#> 2 rows x 1 columns
#> $f <dictionary<values=string, indices=int8>>
df <- as.data.frame(tab)
df
#> # A tibble: 2 x 1
#>   f    
#>   <fct>
#> 1 a    
#> 2 c
df$f
#> [1] a c
#> Levels: a b c d

Created on 2020-07-06 by the reprex package (v0.3.0.9001)

@github-actions
Copy link

github-actions bot commented Jul 6, 2020

@nealrichardson nealrichardson requested a review from wesm July 6, 2020 22:43
@romainfrancois
Copy link
Contributor Author

Any reason why ChunkedArray$print() does not use the ToString() C++ method ? @nealrichardson

library(arrow, warn.conflicts = FALSE)

f1 <- factor(c("a", "a"), levels = c("a", "b"))
f2 <- factor(c("c"), levels = c("c", "d"))
f3 <- factor(NA, levels = c("d"))

ca <- ChunkedArray$create(f1, f2, f3)
ca
#> ChunkedArray
#> <dictionary<values=string, indices=int8>>
#> 
#> -- dictionary:
#>   [
#>     "a",
#>     "b"
#>   ]
#> -- indices:
#>   [
#>     0,
#>     0
#>   ]

Created on 2020-07-07 by the reprex package (v0.3.0.9001)

I have a stashed commit that makes this:

library(arrow, warn.conflicts = FALSE)

f1 <- factor(c("a", "a"), levels = c("a", "b"))
f2 <- factor(c("c"), levels = c("c", "d"))
f3 <- factor(NA, levels = c("d"))

ca <- ChunkedArray$create(f1, f2, f3)
ca
#> ChunkedArray
#> [
#> 
#>   -- dictionary:
#>     [
#>       "a",
#>       "b"
#>     ]
#>   -- indices:
#>     [
#>       0,
#>       0
#>     ],
#> 
#>   -- dictionary:
#>     [
#>       "c",
#>       "d"
#>     ]
#>   -- indices:
#>     [
#>       0
#>     ],
#> 
#>   -- dictionary:
#>     [
#>       "d"
#>     ]
#>   -- indices:
#>     [
#>       null
#>     ]
#> ]

Created on 2020-07-07 by the reprex package (v0.3.0.9001)

I can put this on another jira/pr though.

Independently, should the printed dictionary be the unified one ? For now, this PR only unifies on conversion back to R, but that does not seem right ?

@romainfrancois
Copy link
Contributor Author

In other words, when creating a chunked array from a list of factors from R, should the dictionary be unified and shared across the arrays of the chunked array ?

@romainfrancois
Copy link
Contributor Author

I think we can leave this for a follow up:

    // R factor levels must be type "character" so coerce `dict` to STRSXP
    // TODO (npr): this coercion should be optional, "dictionariesAsFactors" ;)
    // Alternative: preserve the logical type of the dictionary values
    // (e.g. if dict is timestamp, return a POSIXt R vector, not factor)

as this will require some additional vctrs effort to implement some sort of generic R representation for dictionaries that are not factors

@nealrichardson
Copy link
Member

Re: ChunkedArray print method, git blame says it was introduced in #5492. I would guess that I added a custom method so that the printing wouldn't explode off the screen if you have a big array. Or maybe so that the internal chunking details weren't exposed since that's not always helpful. Ok with me if you want to change it, but I wouldn't unify the dictionaries in the print method--if you're trying to show more about the internals of what's in the array, show what's actually there.

Re: dictionariesAsFactors, that's ARROW-7657, fine to keep it out of scope here.

@romainfrancois
Copy link
Contributor Author

Fair enough, I now remember that levels (or whatever they are called in arrow) used to be part of the type, but not anymore, so no unification until they need to be converted back to a single R factor, which is then absolutely needed and the business of this PR.

I will apply my stash, and so let arrow deal with the printing.

Perhaps at some point we can have something similar to dplyr::glimpse() to print things more succinctly ...

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests make sense to me but I'm not sure I can properly review the Rcpp changes. Would also be good for @wesm to review and make sure that this covers all of the requirements.


Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
R_xlen_t start, R_xlen_t n) const {
R_xlen_t start, R_xlen_t n, size_t array_index) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these unrelated converters need an additional arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this only gets called from Converter::Ingest_some_nulls() so they all need to have the same interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the dispatch could rather be done by arrow::VisitTypeInline() but I'd rather do this in a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have a uniform interface. I would like a comment indicating that most implementations will ignore the chunk_index and that (IIUC) only the dictionary conversion path currently uses it.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach makes sense to me; we only unify dictionaries when necessary due to conversion to R factors while ChunkedArrays maintain the capacity to store per-chunk dictionaries. A few changes for clarity:


Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
R_xlen_t start, R_xlen_t n) const {
R_xlen_t start, R_xlen_t n, size_t array_index) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have a uniform interface. I would like a comment indicating that most implementations will ignore the chunk_index and that (IIUC) only the dictionary conversion path currently uses it.

@nealrichardson
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants