Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add obsm, etc to ExperimentAxisQuery #179

Merged
merged 22 commits into from
Dec 4, 2023
Merged

Add obsm, etc to ExperimentAxisQuery #179

merged 22 commits into from
Dec 4, 2023

Conversation

ebezzi
Copy link
Member

@ebezzi ebezzi commented Nov 6, 2023

Adds obsm, obsp, varm, varp to ExperimentAxisQuery._read and .to_anndata.

Unit tests can be found here: single-cell-data/TileDB-SOMA#1934

@ebezzi ebezzi requested a review from johnkerl November 6, 2023 23:44
joinids = getattr(self._joinids, axis.value)
return axism[layer].read((joinids, col_joinids))

def _axism_inner_csr(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method exists purely to be passed to _read, which loads everything in memory. This works for now but I am not sure this is the right format we want for obsm. In principle, for obsm it would be better to use a csc matrix since it has way more rows than column. Most embeddings though are dense, so maybe we could just use a numpy array here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest looking at the AnnData documentation. obsm/varm/obsp/varp are ndarray in typical use, and I don't see any reason they would not be here... Using a sparse matrix is likely to create incompatibility with downstream user assumptions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, for some reason I thought AnnData didn't enforce the typing here, but it looks like it does. I'll convert to ndarray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its more than "enforce typing" - it is conceptually a dense array in the AnnData data model (but not in SOMA)

def to_anndata(
self,
X_name: str,
*,
column_names: Optional[AxisColumnNames] = None,
X_layers: Sequence[str] = (),
obsm_keys: Sequence[str] = [],
Copy link
Member

@bkmartinjr bkmartinjr Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use consistent names for X and obsm:

  • X_layers
  • obsm_keys

Suggest simply using _layers

Also, use a tuple, not a list, as the default, a la X_layers. I'm surprised this passed mypy linting...

Copy link
Member

@bkmartinjr bkmartinjr Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, missing the rest of the layers (varm, etc). Any reason we should not add all of it in one pass?

It is probably worth thinking about how to do this, e.g.,

  • simply add all four to the function as optional args,
  • or, add a dict-like structure a la column_names.

@thetorpedodog - style suggestions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I add the other layers, I'd like the approach to be validated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, it looks like AnnData uses "keys" to define the members of obsm, etc. layers seems to specifically refer to X. I'll leave the naming as is right now, but open to more discussion.

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

several concerns:

  • this doesn't produce a "legal" AnnData, which specifies that obsm/obsp/varm/varp are ndarray
  • the purpose/signature of query._read seems to morphed over time (not just from this PR), such that the docstrings and sig don't match the original "produces pure Arrow objects" intent. We should make an intentional decision to either a) dedicate this code path to the to_anndata use case exclusively, or b) refactor it back to generating Arrow objects, with the conversion to AnnData wrapped around it. I'm OK with either.

Copy link
Contributor

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m liking what I see in general. I have a few points of feedback, both API design–related and otherwise.

python-spec/src/somacore/query/query.py Show resolved Hide resolved
python-spec/src/somacore/query/query.py Outdated Show resolved Hide resolved
Comment on lines 385 to 387
obsm = dict()
for key in obsm_keys:
obsm[key] = self._axism_inner_ndarray(_Axis.OBS, key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non–API style note: this can be made into a comprehension:

obsm = {
  key: self._axism...(..., key)
}

joinids = getattr(self._joinids, axis.value)
return axism[layer].read((joinids, slice(None)))

def _convert_to_ndarray(self, is_obs: bool, T: pa.Table, n_row: int, n_col: int) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More non–API design feedback: to avoid boolean parameters, I would make this either the _Axis you’re using, or pass in the indexer function (looking like self._convert_to_ndarray(self.indexer.by_obs, ...)). Also, to avoid confusion with type names or generic parameters, prefer lowercase for parameter / variable names.

Comment on lines 535 to 536
is_obs = axis is _Axis.OBS
n_row = n_col = len(self._joinids.obs) if is_obs else len(self._joinids.var)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not something for you to do; just thinking aloud here:

it might make sense to add a function to _Axis that gets the given attribute, so that this could look something like:

n_row = n_col = axis.getattr_from(self._joinids)

so it does self._joinids.obs and self._joinids.var by itself without you having to specify (and thus avoids potential obs/var switcheroos)

(maybe also have it take a suffix, so you could say axis.getattr_from(something, "m") to get something.obsm/something.varm)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decided to see what this would look like here #183

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be great - I approved #183. If you want to merge it, I can accommodate the changes here, otherwise we can do it later.

thetorpedodog and others added 4 commits November 30, 2023 12:00
This aims to eliminate the annoyance (and potential error) of writing

    thing = whatever.obs if axis is _Axis.OBS else whatever.var

by letting you say

    thing = axis.getattr_from(whatever)

instead.
@ebezzi ebezzi requested a review from bkmartinjr November 30, 2023 19:48
@ebezzi ebezzi requested a review from bkmartinjr November 30, 2023 23:10
Copy link
Contributor

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor style suggestions; nothing critical. Looks good!

Comment on lines 402 to 405
obsm = obsm_ft.result()
obsp = obsp_ft.result()
varm = varm_ft.result()
varp = varp_ft.result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like these could be passed as the named arguments to _AxisQueryResult without the need for a temporary variable.

Comment on lines 524 to 529
if key not in self._ms:
raise ValueError(f"Measurement does not contain {key} data")

axism = axis.getitem_from(self._ms, suf="m")
if not (layer and layer in axism):
raise ValueError(f"Must specify '{key}' layer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both of these cases, you can use EAFP style:

try:
  axism = axis.getitem_from(...)
except KeyError as ke:
  raise ValueError(...) from ke  # or maybe `from None`
try:
  axism_layer = axism[layer]
except KeyError as ke:
  ...

if not isinstance(axism[layer], data.SparseNDArray):
raise TypeError(f"Unexpected SOMA type stored in '{key}' layer")

joinids = getattr(self._joinids, axis.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

axis.getitem_from(self._joinids)

) -> np.ndarray:
indexer: pd.Index = axis.getattr_from(self.indexer, pre="by_")
idx = indexer(table["soma_dim_0"])
Z = np.zeros(n_row * n_col, dtype=np.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: recommend making z lowercase

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

_read_axis_mappings, self._axism_inner_ndarray, _Axis.OBS, obsm_keys
)
obsp_ft = self._threadpool.submit(
_read_axis_mappings, self._axisp_inner_ndarray, _Axis.OBS, obsp_keys
Copy link
Member

@pablo-gar pablo-gar Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkmartinjr and @ebezzi I believe you had a conversation about this offline (or maybe here but I don't see it). We know that the most common uses case of axisp arrays are numerically sparse, and even though anndata's schema says they should be numpy dense arrays, scanpy's methods fill it in with scipy sparse matrices.

Ultimately we would like to move to a world where either AnnData schema takes both (dense and sparse) or only sparse. Given that the ecosystem already violates the schema in favor of better numerical representation I lean towards tiledbsoma also violating the schema, and then we request AnnData's schema to be relaxed.

What are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I have no issues. You might want to ping Isaac V. and see what he thinks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I wanted to make sure I didn't miss anything important. I don't see it as a blocker for now, I will file a few issues (here and in AnnData) to move towards the support of sparse matrices in the axisp arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also look at this page, where they claim that obsm etc can be sparse. This is relative to the on-disk format, but I don't believe there is anything that converts them when loading in memory.

Copy link
Member

@pablo-gar pablo-gar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from an API signature viewpoint!

Copy link
Member

@aaronwolen aaronwolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏🏻

@ebezzi ebezzi merged commit 10dc344 into main Dec 4, 2023
6 checks passed
@ebezzi ebezzi deleted the ebezzi/add-obsm-etc branch December 4, 2023 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants