ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array #7645

romainfrancois · 2020-07-06T15:35:14Z

This needs some testing:

library(arrow, warn.conflicts = FALSE)

f1 <- factor(c("a"), levels = c("a", "b"))
f2 <- factor(c("c"), levels = c("c", "d"))

ca <- ChunkedArray$create(f1, f2)
ca$as_vector()
#> [1] a c
#> Levels: a b c d
ca$type
#> DictionaryType
#> dictionary<values=string, indices=int8>

tab <- Table$create(
  record_batch(f = f1), 
  record_batch(f = f2)
)
tab
#> Table
#> 2 rows x 1 columns
#> $f <dictionary<values=string, indices=int8>>
df <- as.data.frame(tab)
df
#> # A tibble: 2 x 1
#>   f    
#>   <fct>
#> 1 a    
#> 2 c
df$f
#> [1] a c
#> Levels: a b c d

^{Created on 2020-07-06 by the reprex package (v0.3.0.9001)}

github-actions · 2020-07-06T15:41:46Z

https://issues.apache.org/jira/browse/ARROW-8374

romainfrancois · 2020-07-07T08:06:46Z

Any reason why ChunkedArray$print() does not use the ToString() C++ method ? @nealrichardson

library(arrow, warn.conflicts = FALSE)

f1 <- factor(c("a", "a"), levels = c("a", "b"))
f2 <- factor(c("c"), levels = c("c", "d"))
f3 <- factor(NA, levels = c("d"))

ca <- ChunkedArray$create(f1, f2, f3)
ca
#> ChunkedArray
#> <dictionary<values=string, indices=int8>>
#> 
#> -- dictionary:
#>   [
#>     "a",
#>     "b"
#>   ]
#> -- indices:
#>   [
#>     0,
#>     0
#>   ]

^{Created on 2020-07-07 by the reprex package (v0.3.0.9001)}

I have a stashed commit that makes this:

library(arrow, warn.conflicts = FALSE)

f1 <- factor(c("a", "a"), levels = c("a", "b"))
f2 <- factor(c("c"), levels = c("c", "d"))
f3 <- factor(NA, levels = c("d"))

ca <- ChunkedArray$create(f1, f2, f3)
ca
#> ChunkedArray
#> [
#> 
#>   -- dictionary:
#>     [
#>       "a",
#>       "b"
#>     ]
#>   -- indices:
#>     [
#>       0,
#>       0
#>     ],
#> 
#>   -- dictionary:
#>     [
#>       "c",
#>       "d"
#>     ]
#>   -- indices:
#>     [
#>       0
#>     ],
#> 
#>   -- dictionary:
#>     [
#>       "d"
#>     ]
#>   -- indices:
#>     [
#>       null
#>     ]
#> ]

^{Created on 2020-07-07 by the reprex package (v0.3.0.9001)}

I can put this on another jira/pr though.

Independently, should the printed dictionary be the unified one ? For now, this PR only unifies on conversion back to R, but that does not seem right ?

romainfrancois · 2020-07-07T09:41:49Z

In other words, when creating a chunked array from a list of factors from R, should the dictionary be unified and shared across the arrays of the chunked array ?

romainfrancois · 2020-07-07T10:03:25Z

I think we can leave this for a follow up:

    // R factor levels must be type "character" so coerce `dict` to STRSXP
    // TODO (npr): this coercion should be optional, "dictionariesAsFactors" ;)
    // Alternative: preserve the logical type of the dictionary values
    // (e.g. if dict is timestamp, return a POSIXt R vector, not factor)

as this will require some additional vctrs effort to implement some sort of generic R representation for dictionaries that are not factors

nealrichardson · 2020-07-07T17:20:26Z

Re: ChunkedArray print method, git blame says it was introduced in #5492. I would guess that I added a custom method so that the printing wouldn't explode off the screen if you have a big array. Or maybe so that the internal chunking details weren't exposed since that's not always helpful. Ok with me if you want to change it, but I wouldn't unify the dictionaries in the print method--if you're trying to show more about the internals of what's in the array, show what's actually there.

Re: dictionariesAsFactors, that's ARROW-7657, fine to keep it out of scope here.

romainfrancois · 2020-07-08T08:08:21Z

Fair enough, I now remember that levels (or whatever they are called in arrow) used to be part of the type, but not anymore, so no unification until they need to be converted back to a single R factor, which is then absolutely needed and the business of this PR.

I will apply my stash, and so let arrow deal with the printing.

Perhaps at some point we can have something similar to dplyr::glimpse() to print things more succinctly ...

nealrichardson

Tests make sense to me but I'm not sure I can properly review the Rcpp changes. Would also be good for @wesm to review and make sure that this covers all of the requirements.

nealrichardson · 2020-07-08T17:05:55Z

r/src/array_to_vector.cpp


  Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                           R_xlen_t start, R_xlen_t n) const {
+                           R_xlen_t start, R_xlen_t n, size_t array_index) const {


Why do these unrelated converters need an additional arg?

Because this only gets called from Converter::Ingest_some_nulls() so they all need to have the same interface.

Perhaps the dispatch could rather be done by arrow::VisitTypeInline() but I'd rather do this in a follow up

I think it's better to have a uniform interface. I would like a comment indicating that most implementations will ignore the chunk_index and that (IIUC) only the dictionary conversion path currently uses it.

bkietz

This approach makes sense to me; we only unify dictionaries when necessary due to conversion to R factors while ChunkedArrays maintain the capacity to store per-chunk dictionaries. A few changes for clarity:

r/src/array_to_vector.cpp

bkietz · 2020-07-09T16:00:15Z

r/src/array_to_vector.cpp


  Status Ingest_some_nulls(SEXP data, const std::shared_ptr<arrow::Array>& array,
-                           R_xlen_t start, R_xlen_t n) const {
+                           R_xlen_t start, R_xlen_t n, size_t array_index) const {


I think it's better to have a uniform interface. I would like a comment indicating that most implementations will ignore the chunk_index and that (IIUC) only the dictionary conversion path currently uses it.

Co-authored-by: Benjamin Kietzman <[email protected]>

nealrichardson · 2020-07-10T15:13:17Z

Thanks!

romainfrancois added 4 commits July 6, 2020 10:42

simplifying symcols.cpp with preserved_strings()

cd76211

cache factor and ordered class vectors

463a6db

+ ConverterDictionary::NeedUnification()

788ea41

unifying dictionary arrays

b6b4f1b

nealrichardson requested a review from wesm July 6, 2020 22:43

+ tests

0599190

romainfrancois added 3 commits July 8, 2020 10:37

using internal ChunkedArray::ToString() method

a26a106

using testthat::verify_output() for checking ChunkedArray printed output

af1c6d8

exclude testthat/test-*.txt files

5ed17d8

nealrichardson approved these changes Jul 8, 2020

View reviewed changes

bkietz requested changes Jul 9, 2020

View reviewed changes

romainfrancois and others added 5 commits July 10, 2020 10:55

Update r/src/array_to_vector.cpp

29aa4ce

Co-authored-by: Benjamin Kietzman <[email protected]>

s/array_index/chunk_index/

55fc842

Merge branch 'master' into ARROW-8374/Dictionary_unification

9219a23

comment about chunk_index being only relevant for dictionary arrays

78e4905

lint

61c7a65

nealrichardson closed this in c02ea96 Jul 10, 2020

asfimport mentioned this pull request Jul 10, 2020

[R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array #24558

Closed

ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array #7645

ARROW-8374 [R]: Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array #7645

Uh oh!

Conversation

romainfrancois commented Jul 6, 2020

Uh oh!

github-actions bot commented Jul 6, 2020

Uh oh!

romainfrancois commented Jul 7, 2020

Uh oh!

romainfrancois commented Jul 7, 2020

Uh oh!

romainfrancois commented Jul 7, 2020

Uh oh!

nealrichardson commented Jul 7, 2020

Uh oh!

romainfrancois commented Jul 8, 2020

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

nealrichardson Jul 8, 2020

Choose a reason for hiding this comment

Uh oh!

romainfrancois Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

romainfrancois Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

bkietz Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bkietz Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Jul 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants