Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch integration - Create separate solution file #218

Merged
merged 13 commits into from
Aug 30, 2023

Conversation

rcannood
Copy link
Member

@rcannood rcannood commented Aug 23, 2023

Describe your changes

  • Batch integration: Split the unintegrated file into a solution and a dataset file

Issue ticket number and link

Closes #xxxx (Replace xxxx with the GitHub issue number)

Checklist before requesting a review

  • I have performed a self-review of my code

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md

  • CI Tests succeed and look good!

@rcannood rcannood changed the title wip commit Batch integration - Create separate solution file Aug 23, 2023
Copy link
Member Author

@rcannood rcannood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumichae I added some comments for your review

@rcannood rcannood requested a review from mumichae August 23, 2023 12:59
Copy link
Collaborator

@mumichae mumichae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Just a general comment on how we use the solution adata. It's probably a question of preference, but for metrics that don't compare integrated and unintegrated objects, is it necessary to transfer the integrated representations to the solution object? I would treat the solution more as the unintegrated object from which I can pull additional metadata if needed (which for metrics we have here is not really the case, but there are other metrics (existing and potentially future metrics) that would need that information. Personally, I find it more confusing to transfer the integrated representations to the solution object instead of doing it the other way arround (selectively extracting metadata from the solution for the integrated object)

description: The organism of the sample in the dataset.
required: false
- type: object
name: knn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same information as in .uns['neighbors'] that scanpy.neighbors generates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yepyep! Should I rename it to knn_neighbors?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'neighbors' is the default and what most methods expect out of the box, so I'd go with that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the obsp is called knn_distances and knn_connectivities, so then we should rename that to distances and connectivities as well. WDYT?

Comment on lines +18 to +25
input_solution = ad.read_h5ad(par['input_solution'])
input_integrated = ad.read_h5ad(par['input_integrated'])

input_solution.obsp["connectivities"] = input_integrated.obsp["connectivities"]
input_solution.obsp["distances"] = input_integrated.obsp["distances"]

# TODO: if we don't copy neighbors over, the metric doesn't work
input_solution.uns["neighbors"] = input_integrated.uns["neighbors"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to transfer the kNN graph to the solution? Wouldn't be sufficient to run on the integrated input only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to compare the integrated object's KNN graph to the solution's label, no? Because if we only compute the ARI and NMI on the data that is in the integrated object, then we don't know for certain that this anndata still has the same dimensionality as the solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the metadata to the integrated object instead. That way we can check that the dimensions match, but are essentially evaluating the integrated object. I think it's a matter of taste and would find it more intuitive to base things around the integrated object

Comment on lines +15 to +22
input_solution = ad.read_h5ad(par['input_solution'])
input_integrated = ad.read_h5ad(par['input_integrated'])

input_solution.obsp["connectivities"] = input_integrated.obsp["connectivities"]
input_solution.obsp["distances"] = input_integrated.obsp["distances"]

# TODO: if we don't copy neighbors over, the metric doesn't work
input_solution.uns["neighbors"] = input_integrated.uns["neighbors"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to transfer the kNN graph to the solution? Wouldn't be sufficient to run on the integrated input only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather compare the integrated graph to the solution's label, to make sure that the integrated object has the same dimensionality.

Of course, we could also just put an assertion here.

@rcannood
Copy link
Member Author

rcannood commented Aug 24, 2023

Looks good overall. Just a general comment on how we use the solution adata. It's probably a question of preference, but for metrics that don't compare integrated and unintegrated objects, is it necessary to transfer the integrated representations to the solution object?

I'm transferring the integrated representations to the solution object to ensure that the method output didn't make any changes to the labels and such. Alternatively we could also do a lot of assertions instead of copying the data over, although it seems easier to me to just copy the method output over to the solution object instead.

I would treat the solution more as the unintegrated object from which I can pull additional metadata if needed (which for metrics we have here is not really the case, but there are other metrics (existing and potentially future metrics) that would need that information.

👍

Personally, I find it more confusing to transfer the integrated representations to the solution object instead of doing it the other way arround (selectively extracting metadata from the solution for the integrated object)

I guess it's more or less the same thing, as long as we're sure that there were no changes to the method output which could affect the metric scores in some way.

@rcannood rcannood merged commit 0ec105d into main Aug 30, 2023
5 checks passed
@rcannood rcannood deleted the batch_integration/create_separate_solution_file branch August 30, 2023 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants