Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unnest_tree() #1386

Closed
wants to merge 8 commits into from
Closed

Add unnest_tree() #1386

wants to merge 8 commits into from

Conversation

mgirlich
Copy link
Contributor

@mgirlich mgirlich commented Aug 25, 2022

Closes #1384.

@DavisVaughan Would you be interested in adding this to tidyr? If so, what about the inverse operation nest_tree()?

# unclass to avoid slow `[[.tbl_df` and `[[<-.tbl_df`
x <- unclass(x)
child_children <- x[[child_col]]
if (inherits(child_children, "vctrs_list_of")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the vctrs_list_of class has huge impact on performance:

After removing:

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
test_tree <- readr::read_rds("~/GitHub/tidyr/test-tree.rds")
bench::mark(
  unnest_tree = unnest_tree(
    test_tree,
    id,
    children
  )
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest_tree    170ms    180ms      5.27    3.97MB     7.02

Created on 2022-08-25 with reprex v2.0.2

Before removing:

#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest_tree    26.9s    26.9s    0.0372     508MB     7.62

Created on 2022-08-25 with reprex v2.0.2

}

# unclass to avoid slow `[[.tbl_df` and `[[<-.tbl_df`
x <- unclass(x)
Copy link
Contributor Author

@mgirlich mgirlich Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not quite as bad as the list_of performance hit but still a lot. Without unclassing this takes

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
test_tree <- readr::read_rds("~/GitHub/tidyr/test-tree.rds")
bench::mark(
  unnest_tree = unnest_tree(
    test_tree,
    id,
    children
  )
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest_tree    425ms    428ms      2.34    3.89MB     7.01

Created on 2022-08-25 with reprex v2.0.2

@DavisVaughan
Copy link
Member

I won't be able to take a look for a while, but it seems interesting at first glance

BTW I just merged #1387 so you should be able to rebase/merge-main so the checks can pass (except some weirdness with the pkgdown one right now)

@mgirlich
Copy link
Contributor Author

Thanks for the quick feedback. I'm fine with waiting for a review, already good to know that you consider it interesting. Some questions for the future:

  • Must all children have the same columns?
  • Should it really be an error if column children does not exist at top level?
  • What if child has column level_to, parent_to, or ancestors_to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rectangling tools for tree like data frames
3 participants