RFC on `soma_joinid`: identifier or index? #164

mlin · 2023-05-10T07:58:16Z

Preview link: https://github.com/single-cell-data/SOMA/blob/mlin/rfc-on-r-soma-joinid/rfcs/soma_joinid_id_or_index.md

johnkerl · 2023-05-10T12:19:03Z

rfcs/soma_joinid_id_or_index.md

+
+Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.
+
+**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes. 


Suggested change

**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.

**However**, beyond this primary definition, `soma_joinid` is also used as the row and column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.

If X is sparse then row and column numbers are not necessarily indices.

Only if it is dense do they necessarily become indices.

bkmartinjr · 2023-05-10T17:32:27Z

rfcs/soma_joinid_id_or_index.md

+
+> Every `SOMADataFrame` must contain a column called `soma_joinid`, of type `int64` and domain `[0, 2^63-1]`. The `soma_joinid` column contains a unique value for each row in the `SOMADataFrame`, and intended to act as a join key for other objects, such as `SOMASparseNDArray`.
+
+Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.


I find the definition of index confusing - there is nothing about an index that requires it to be contiguous, starting at zero/one, etc.

I think you want to separate the concept of an "index" (aka key) from "offset" (in Pandas terminology, you could have a string index, but you always have an integer iloc (offset)).

Using index in this way confuses the DB organization with the in-memory representation.

bkmartinjr · 2023-05-10T17:38:31Z

rfcs/soma_joinid_id_or_index.md

+
+Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.
+
+**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes. 


I disagree with this. For sparse arrays (dataframe, sparsendarray), soma_dim_N is not an index, in the way you define it above (i.e. a physical offset). It is an identifier, just like the soma_joinid. The only constraints on soma_joinid/soma_dim_N are that they are uint64. They can be in any order, be non-contiguous, and span any range.

Common usage will use a contiguous range, but the spec absolutely (and intentionally) doesn't require that.

The one nuance here -- if you have a sparse X in an Experiment, you will only have implicit zeros where there is a defined soma_joinid pair in the obs/var DataFrame. In other words, any soma_dim_N not in the sparse array does not exist in the X index domain unless the joinid is defined in the corresponding obs/var dataframe.

Dense is another matter, and still has unresolved design issues due to the fact that we allow "sparse" joinid domains in the obs/var dataframes.

@bkmartinjr @johnkerl We need to find the precise terminology to convey the relationship between soma_joinid (not an index) with the row/column numbers of a matrix (obviously indexed sequences; whether the matrix is represented sparsely is an implementation detail, in my view). The original draft is too strong in equating them, admitted, but we need to convey that the treatment of soma_joinid as matrix row/column numbers in a certain context imposes certain constraints on its initial, broad definition -- that is probably the most important point of the whole essay.

bkmartinjr · 2023-05-10T17:40:07Z

rfcs/soma_joinid_id_or_index.md

+
+## 2. Dense array indexing
+
+The definition of `soma_joinid` as an arbitrary integer identifier is problematic for constructing dense matrices indexed by this value, since the values needn't start at zero/one and may be much larger than the actual number of entries. [The role of dense matrices remains unresolved](https://github.com/single-cell-data/TileDB-SOMA/issues/1245) as of this writing, and sparse matrices have been suitable for our initial applications. To fully support dense representations in the future, we may need to introduce another column in the `obs`/`var` data frames that is explicitly the index (row number), and use that for dense indexing instead of `soma_joinid`.


it is also problematic for other reasons (see above). By not requiring the axis dataframes to define contiguous joinid domains, we implicitly make them incompatible with dense matrices

bkmartinjr · 2023-05-10T18:10:21Z

ps. @mlin - re-reading my comments, this is complex enough topic that it may warrant a real-time conversation. I'm available if you think this would be useful.

thetorpedodog · 2023-05-10T19:43:45Z

two non–content-related notes from me (these apply to both open RFCs right now so I am pasting this note in both places):

Rebase so that you get the pre-commit format verification thingy
(Optional) Consider formatting to one sentence per line. I did this in 3b5891c, but I didn’t formally write it down, so it’s not a hard Rule but it is a format I have found useful for editing.

add rfcs/soma_joinid_id_or_index.md

466f0ed

mlin requested review from Shelnutt2, aaronwolen, johnkerl, bkmartinjr, maniarathi, mojaveazure, pablo-gar and atolopko-czi May 10, 2023 07:58

polish

e4e9c67

johnkerl approved these changes May 10, 2023

View reviewed changes

bkmartinjr reviewed May 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC on `soma_joinid`: identifier or index? #164

RFC on `soma_joinid`: identifier or index? #164

mlin commented May 10, 2023 •

edited

Loading

johnkerl May 10, 2023

johnkerl May 10, 2023

bkmartinjr May 10, 2023

bkmartinjr May 10, 2023

mlin May 10, 2023

bkmartinjr May 10, 2023

bkmartinjr commented May 10, 2023

thetorpedodog commented May 10, 2023


		Notice this defines `soma_joinid` merely as an identifier; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an index. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.

		However, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly is an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.


		> Every `SOMADataFrame` must contain a column called `soma_joinid`, of type `int64` and domain `[0, 2^63-1]`. The `soma_joinid` column contains a unique value for each row in the `SOMADataFrame`, and intended to act as a join key for other objects, such as `SOMASparseNDArray`.

		Notice this defines `soma_joinid` merely as an identifier; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an index. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.


		## 2. Dense array indexing

		The definition of `soma_joinid` as an arbitrary integer identifier is problematic for constructing dense matrices indexed by this value, since the values needn't start at zero/one and may be much larger than the actual number of entries. [The role of dense matrices remains unresolved](https://github.com/single-cell-data/TileDB-SOMA/issues/1245) as of this writing, and sparse matrices have been suitable for our initial applications. To fully support dense representations in the future, we may need to introduce another column in the `obs`/`var` data frames that is explicitly the index (row number), and use that for dense indexing instead of `soma_joinid`.

RFC on soma_joinid: identifier or index? #164

Are you sure you want to change the base?

RFC on soma_joinid: identifier or index? #164

Conversation

mlin commented May 10, 2023 • edited Loading

johnkerl May 10, 2023

Choose a reason for hiding this comment

johnkerl May 10, 2023

Choose a reason for hiding this comment

bkmartinjr May 10, 2023

Choose a reason for hiding this comment

bkmartinjr May 10, 2023

Choose a reason for hiding this comment

mlin May 10, 2023

Choose a reason for hiding this comment

bkmartinjr May 10, 2023

Choose a reason for hiding this comment

bkmartinjr commented May 10, 2023

thetorpedodog commented May 10, 2023

RFC on `soma_joinid`: identifier or index? #164

RFC on `soma_joinid`: identifier or index? #164

mlin commented May 10, 2023 •

edited

Loading