Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_split #97

Merged
merged 23 commits into from
Dec 10, 2023
Merged

group_split #97

merged 23 commits into from
Dec 10, 2023

Conversation

B0ydT
Copy link
Contributor

@B0ydT B0ydT commented Oct 15, 2023

I've got a group_split implementation working for row and column data. I can work on group_by if you're happy with it. If not, let me know where it needs work.

I was a bit stuck overthinking some decisions but just decided to go for it, so I'm very happy to change things up. For example, I wasn't sure if it would be preferable to specify grouping by row/column or autodetecting like I ended up doing.

Issue #71

@stemangiola
Copy link
Owner

Does the function works with an arbitrary set of variables?

expect_equal(length(fd), length(unique(df$groups)))

fd <- df |>
group_split("vst.variable")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: what is vst.variable? I cannot find it in the object metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the rowData. I realise now that code wasn't doing what I thought it did and that it wouldn't have been helpful if it had!

@B0ydT
Copy link
Contributor Author

B0ydT commented Oct 21, 2023

Does the function works with an arbitrary set of variables?

It does now!

B0ydT added 3 commits October 22, 2023 11:30
I seem to be doing something that the SCE version of unite doesn't like. Maybe could avoiding adding a column in the first place.
@B0ydT
Copy link
Contributor Author

B0ydT commented Oct 22, 2023

I refactored the code in an attempt to fix the error. The function works when I run it myself and during unit tests but fails when I rebuild and run R CMD Check. It throws an error saying unite doesn't like tidySCE objects, but I have explicitly converted the object to a tibble, so I don't understand why that is still happening.

Copy link
Owner

@stemangiola stemangiola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does dplyr allows for arbitrary tidy select functionalities? e.g. contains, starts_with, etc? If so, you might want to create a split data frame using select(...).

R/dplyr_methods.R Outdated Show resolved Hide resolved
filter(group_col == group_list[[i]], )

v[[i]] <- select(v[[i]], !group_col)
v[[i]] <- .data[,group_list == groups[[i]]]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does group_col get eliminated after the splitting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified this to make it more obvious, but group_col the object never gets added to the table itself. I have added the .keep option to drop the original columns, however.

R/dplyr_methods.R Show resolved Hide resolved
@stemangiola
Copy link
Owner

Hello @B0ydT, any news about this PR? Let me know if you need help/more explanation.

@stemangiola
Copy link
Owner

ping

@B0ydT
Copy link
Contributor Author

B0ydT commented Dec 7, 2023

Thanks for your feedback. It was very clear and I addressed most of it. I'm not 100% sure where you want to use select, though.

I'm still getting an R CMD Check error for the call to unite. The function works just fine, but generates an error when checked.

Edit: My example was pbmc_small |> group_split(pbmc_small, groups) 🤦‍♂️

@stemangiola stemangiola self-requested a review December 9, 2023 07:07
Copy link
Owner

@stemangiola stemangiola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks.

The function works well for simple cases. But please notice that with dplyr you can do this

tibble(a=1:10) |> group_split(a>5)
tibble(a=1:10) |> group_split(a==5)

With SingleCellExperiment, I would like to be able to do

pbmc_small |> group_split(PC_1>0)

or

pbmc_small |> group_split(groups=="g1")

This is easy to achieve, preserving your variable query as tidy select. Also, look for "special_column" in the package, and you will see how I adapt all queries to all columns displayed in the Tibble representation. Ideally, each function is completely general. For example

pbmc_small |> group_split(PC_1>0 & groups=="g1")

We are close!

@B0ydT
Copy link
Contributor Author

B0ydT commented Dec 9, 2023

I'm not sure it's exactly the method you had in mind, but I think I've fixed it. I totally forgot those logical statements in group_by exist, as I never really use them myself.

I saw a lot of your other methods make use of the original dplyr functions. I had given up on group_split, because I didn't see how a list of tbls would help me split the SCE object, but I just came across group_rows, which allows you to extract the indices from a grouped tbl!

It does not add those "PC_1>0" type columns with logical values yet, but I should be able to add those shortly.

Copy link
Owner

@stemangiola stemangiola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing. Please add my tests above as unit tests, and then I think we might be done!

@B0ydT
Copy link
Contributor Author

B0ydT commented Dec 10, 2023

Have added the tests.

I am trying to sidestep all of the name corrections so that the names of new columns are consistent with what you'd get from the dplyr functions, i.e. groups=="g1". This is also necessary so that they can be dropped when .keep = FALSE.

The closest I've gotten is

  colData(.tbl) <- .tbl |> 
    colData() |> 
    as_tibble() |> 
    dplyr::mutate(!!!var_list) |> 
    DataFrame(check.names = FALSE)

I can use colData(.tbl) to see that the names made it in unchanged, but any of the dplyr methods for SCEs 'fix' the names. I could, of course, add the columns after the split, but I don't think anyone wants to see groups.....g1. in their column names, and I'd still need to rethink the .keep method. Of course, if the underlying package heavily relies on the assumption that names will all be correct, then my ideal solution may not be viable.

@stemangiola
Copy link
Owner

Have added the tests.

I am trying to sidestep all of the name corrections so that the names of new columns are consistent with what you'd get from the dplyr functions, i.e. groups=="g1". This is also necessary so that they can be dropped when .keep = FALSE.

The closest I've gotten is

  colData(.tbl) <- .tbl |> 
    colData() |> 
    as_tibble() |> 
    dplyr::mutate(!!!var_list) |> 
    DataFrame(check.names = FALSE)

I can use colData(.tbl) to see that the names made it in unchanged, but any of the dplyr methods for SCEs 'fix' the names. I could, of course, add the columns after the split, but I don't think anyone wants to see groups.....g1. in their column names, and I'd still need to rethink the .keep method. Of course, if the underlying package heavily relies on the assumption that names will all be correct, then my ideal solution may not be viable.

don't see the problem. this looks good to me

pbmc_small |> group_split(PC_1>0 & groups == "g2") %>% .[[1]] |> select(groups)
tidySingleCellExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
# A tibble: 75 × 1
  groups
  <chr> 
1 g2    
2 g1    
3 g2    
4 g2    
5 g2    
6 g1    
7 g1    
8 g1    
9 g1    
10 g1    

If it behaves well enough for the vast majority of use cases, I would say let's go with this, and we can improve it in the future.

It would be good to translate this to tidyseurat, and the more complicated tidySummarizedExperiment.

@stemangiola stemangiola merged commit ecc9b3a into stemangiola:master Dec 10, 2023
2 of 3 checks passed
@stemangiola
Copy link
Owner

Congrats @B0ydT !

Let me know what you think about repurposing your PR.

@B0ydT B0ydT deleted the boyd branch January 25, 2024 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants