group_split #97

B0ydT · 2023-10-15T09:13:43Z

I've got a group_split implementation working for row and column data. I can work on group_by if you're happy with it. If not, let me know where it needs work.

I was a bit stuck overthinking some decisions but just decided to go for it, so I'm very happy to change things up. For example, I wasn't sure if it would be preferable to specify grouping by row/column or autodetecting like I ended up doing.

Issue #71

issue #35975561

into boyd

Issue #35975561

into boyd

stemangiola · 2023-10-15T13:41:23Z

Does the function works with an arbitrary set of variables?

stemangiola · 2023-10-16T16:29:23Z

tests/testthat/test-dplyr_methods.R

+    expect_equal(length(fd), length(unique(df$groups)))
+
+    fd <- df |> 
+    group_split("vst.variable")


One question: what is vst.variable? I cannot find it in the object metadata.

It's in the rowData. I realise now that code wasn't doing what I thought it did and that it wouldn't have been helpful if it had!

Issue #97

B0ydT · 2023-10-21T09:57:28Z

Does the function works with an arbitrary set of variables?

It does now!

I seem to be doing something that the SCE version of unite doesn't like. Maybe could avoiding adding a column in the first place.

B0ydT · 2023-10-22T05:09:53Z

I refactored the code in an attempt to fix the error. The function works when I run it myself and during unit tests but fails when I rebuild and run R CMD Check. It throws an error saying unite doesn't like tidySCE objects, but I have explicitly converted the object to a tibble, so I don't understand why that is still happening.

stemangiola

Does dplyr allows for arbitrary tidy select functionalities? e.g. contains, starts_with, etc? If so, you might want to create a split data frame using select(...).

R/dplyr_methods.R

stemangiola · 2023-10-22T07:13:51Z

R/dplyr_methods.R

-        filter(group_col == group_list[[i]], )
-
-      v[[i]] <- select(v[[i]], !group_col)
+      v[[i]] <- .data[,group_list == groups[[i]]]


Does group_col get eliminated after the splitting?

I have simplified this to make it more obvious, but group_col the object never gets added to the table itself. I have added the .keep option to drop the original columns, however.

R/dplyr_methods.R

stemangiola · 2023-11-10T23:49:20Z

Hello @B0ydT, any news about this PR? Let me know if you need help/more explanation.

stemangiola · 2023-12-04T05:28:42Z

ping

Think I accidentally reverted this

B0ydT · 2023-12-07T09:17:02Z

Thanks for your feedback. It was very clear and I addressed most of it. I'm not 100% sure where you want to use select, though.

I'm still getting an R CMD Check error for the call to unite. The function works just fine, but generates an error when checked.

Edit: My example was pbmc_small |> group_split(pbmc_small, groups) 🤦‍♂️

stemangiola

Amazing, thanks.

The function works well for simple cases. But please notice that with dplyr you can do this

tibble(a=1:10) |> group_split(a>5)

tibble(a=1:10) |> group_split(a==5)

With SingleCellExperiment, I would like to be able to do

pbmc_small |> group_split(PC_1>0)

or

pbmc_small |> group_split(groups=="g1")

This is easy to achieve, preserving your variable query as tidy select. Also, look for "special_column" in the package, and you will see how I adapt all queries to all columns displayed in the Tibble representation. Ideally, each function is completely general. For example

pbmc_small |> group_split(PC_1>0 & groups=="g1")

We are close!

B0ydT · 2023-12-09T09:07:11Z

I'm not sure it's exactly the method you had in mind, but I think I've fixed it. I totally forgot those logical statements in group_by exist, as I never really use them myself.

I saw a lot of your other methods make use of the original dplyr functions. I had given up on group_split, because I didn't see how a list of tbls would help me split the SCE object, but I just came across group_rows, which allows you to extract the indices from a grouped tbl!

It does not add those "PC_1>0" type columns with logical values yet, but I should be able to add those shortly.

stemangiola

Amazing. Please add my tests above as unit tests, and then I think we might be done!

B0ydT · 2023-12-10T05:15:25Z

Have added the tests.

I am trying to sidestep all of the name corrections so that the names of new columns are consistent with what you'd get from the dplyr functions, i.e. groups=="g1". This is also necessary so that they can be dropped when .keep = FALSE.

The closest I've gotten is

  colData(.tbl) <- .tbl |> 
    colData() |> 
    as_tibble() |> 
    dplyr::mutate(!!!var_list) |> 
    DataFrame(check.names = FALSE)

I can use colData(.tbl) to see that the names made it in unchanged, but any of the dplyr methods for SCEs 'fix' the names. I could, of course, add the columns after the split, but I don't think anyone wants to see groups.....g1. in their column names, and I'd still need to rethink the .keep method. Of course, if the underlying package heavily relies on the assumption that names will all be correct, then my ideal solution may not be viable.

stemangiola · 2023-12-10T05:23:18Z

Have added the tests.

I am trying to sidestep all of the name corrections so that the names of new columns are consistent with what you'd get from the dplyr functions, i.e. groups=="g1". This is also necessary so that they can be dropped when .keep = FALSE.

The closest I've gotten is
  colData(.tbl) <- .tbl |> 
    colData() |> 
    as_tibble() |> 
    dplyr::mutate(!!!var_list) |> 
    DataFrame(check.names = FALSE)
I can use colData(.tbl) to see that the names made it in unchanged, but any of the dplyr methods for SCEs 'fix' the names. I could, of course, add the columns after the split, but I don't think anyone wants to see groups.....g1. in their column names, and I'd still need to rethink the .keep method. Of course, if the underlying package heavily relies on the assumption that names will all be correct, then my ideal solution may not be viable.

don't see the problem. this looks good to me

pbmc_small |> group_split(PC_1>0 & groups == "g2") %>% .[[1]] |> select(groups)
tidySingleCellExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
# A tibble: 75 × 1
  groups
  <chr> 
1 g2    
2 g1    
3 g2    
4 g2    
5 g2    
6 g1    
7 g1    
8 g1    
9 g1    
10 g1

If it behaves well enough for the vast majority of use cases, I would say let's go with this, and we can improve it in the future.

It would be good to translate this to tidyseurat, and the more complicated tidySummarizedExperiment.

stemangiola · 2023-12-10T05:41:29Z

Congrats @B0ydT !

Let me know what you think about repurposing your PR.

B0ydT added 6 commits October 15, 2023 17:18

group_split by column

9dfd97e

group_split by column

b87b8dc

issue #35975561

Merge branch 'boyd' of https://github.com/B0ydT/tidySingleCellExperiment

ef56220

into boyd

split groups row data

ea64437

split groups row data

2da015c

Issue #35975561

Merge branch 'boyd' of https://github.com/B0ydT/tidySingleCellExperiment

5b42673

into boyd

stemangiola reviewed Oct 16, 2023

View reviewed changes

B0ydT added 2 commits October 21, 2023 13:34

tidy up code and docs

1ac4299

refactor group_split

3c9648b

Issue #97

B0ydT added 3 commits October 22, 2023 11:30

docs

c12ec7a

fix warning, error persists

9f22be8

I seem to be doing something that the SCE version of unite doesn't like. Maybe could avoiding adding a column in the first place.

should fix error but doesn't

dee22e5

stemangiola reviewed Oct 22, 2023

View reviewed changes

B0ydT and others added 7 commits December 7, 2023 18:22

group_column___

4b7565e

consistency with dplyr

5a55e7a

simplify

1de8c85

Merge branch 'stemangiola:master' into boyd

81a3ee8

quotes to fix global binding issue

9c1cbeb

fix .keep

1ea5829

Think I accidentally reverted this

check dots

c7585e3

B0ydT added 2 commits December 7, 2023 19:35

fix example code

2d52d99

consistency

9347b5d

stemangiola self-requested a review December 9, 2023 07:07

stemangiola requested changes Dec 9, 2023

View reviewed changes

B0ydT added 2 commits December 9, 2023 19:04

drop tests with qoutes

332e30e

use dplyr functions

0fb669a

stemangiola requested changes Dec 9, 2023

View reviewed changes

tests

b2a7e3d

stemangiola approved these changes Dec 10, 2023

View reviewed changes

stemangiola merged commit ecc9b3a into stemangiola:master Dec 10, 2023
2 of 3 checks passed

B0ydT deleted the boyd branch January 25, 2024 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_split #97

group_split #97

B0ydT commented Oct 15, 2023

stemangiola commented Oct 15, 2023

stemangiola Oct 16, 2023

B0ydT Oct 21, 2023

B0ydT commented Oct 21, 2023

B0ydT commented Oct 22, 2023

stemangiola left a comment

stemangiola Oct 22, 2023

B0ydT Dec 7, 2023

stemangiola commented Nov 10, 2023

stemangiola commented Dec 4, 2023

B0ydT commented Dec 7, 2023 •

edited

Loading

stemangiola left a comment

B0ydT commented Dec 9, 2023

stemangiola left a comment

B0ydT commented Dec 10, 2023

stemangiola commented Dec 10, 2023

stemangiola commented Dec 10, 2023

group_split #97

group_split #97

Conversation

B0ydT commented Oct 15, 2023

stemangiola commented Oct 15, 2023

stemangiola Oct 16, 2023

Choose a reason for hiding this comment

B0ydT Oct 21, 2023

Choose a reason for hiding this comment

B0ydT commented Oct 21, 2023

B0ydT commented Oct 22, 2023

stemangiola left a comment

Choose a reason for hiding this comment

stemangiola Oct 22, 2023

Choose a reason for hiding this comment

B0ydT Dec 7, 2023

Choose a reason for hiding this comment

stemangiola commented Nov 10, 2023

stemangiola commented Dec 4, 2023

B0ydT commented Dec 7, 2023 • edited Loading

stemangiola left a comment

Choose a reason for hiding this comment

B0ydT commented Dec 9, 2023

stemangiola left a comment

Choose a reason for hiding this comment

B0ydT commented Dec 10, 2023

stemangiola commented Dec 10, 2023

stemangiola commented Dec 10, 2023

B0ydT commented Dec 7, 2023 •

edited

Loading