Standardize naming of strains/sequences/records #877

victorlin · 2022-03-29T18:00:55Z

from #750

Maybe for later PRs: Standardize naming of "strains", "sequences", and "records"

This is something that's mildly bugged me while going through the codebase. Would love to standardize and this can probably be done without changing the experience for users, since it's mostly about internal variable names and documentation.

If I had to pick from the 3, my vote is on record for the following reasons:

I think we want this term to refer to a data point that is either metadata, sequence, or a combination of both.
strain seems too specific to the virology domain, and within that domain there's already some nuance in strain/mutant/variant (?)
sequence is already a term for the actual DNA/RNA sequence.
Biopython has a SeqRecord, meant for sequence (+ optional metadata). This sounds close to our use case. In practice SeqRecord instances are often variables named record.

The text was updated successfully, but these errors were encountered:

emmahodcroft · 2022-03-30T09:28:12Z

I agree it would be nice to have some standardization there, but we may want to retain using a mix of the three as they can convey slightly different things. For me, intuitively, 'record' would likely imply metadata and not necessarily the sequence, whereas 'sequence' implies more that this is referencing the actual DNA sequence. For example, saying "we'll exclude all records with more than 10 mutations" would, to me, read very strangely. The reverse is a little less strict to my ear (eye?): "We'll exclude all sequences with region 'Europe'" would not raise my eyebrows.
This may come from past experience - I'm not unused to working with data where I have more clinical records or diagnostic reports, etc, and only some of these have attached sequences. Given what I do, I generally am only concerned with the ones with sequences, though often might do basic summary/manipulation on the entire set to give an overview.

While I think it's probably not worth changing strain in general in our internal coding/columns (this would likely break a lot of things for a lot of people), I would be happy to avoid this somewhat in documentation to be more precise, since strain can also mean a distinct pathogen group ("a new strain of X").

tsibley · 2022-04-01T18:42:32Z

Broadly agree with @emmahodcroft here.

record is also as generic as it gets, second only to data, so I'm reticent to prefer that.

If we do choose to further converge on terms (whatever the terms are), I recommend we do it over time as we touch parts of the codebase for other work rather than try to make a sweeping change all at once. That is, make the preferred terms a part of our (informal) codebase "policy" for new/changed code. This avoids creating new work, is less effort since it's incremental, and is less likely to introduce accidental breakage since its integrated into related work that would be getting tested/reviewed more closely than a big find/replace would.

genehack · 2025-01-16T03:52:55Z

Based on Tom's last comment, the lack of movement here, and the general trend in the group towards the progressive/over-time approach Tom advocated, I think this issue can be closed.

nextstrain-bot added this to Nextstrain planning (archived) Mar 30, 2022

nextstrain-bot moved this to New in Nextstrain planning (archived) Mar 30, 2022

victorlin moved this from New to Backlog in Nextstrain planning (archived) Mar 30, 2022

genehack closed this as not planned Won't fix, can't repro, duplicate, stale Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize naming of strains/sequences/records #877

Standardize naming of strains/sequences/records #877

victorlin commented Mar 29, 2022 •

edited

Loading

emmahodcroft commented Mar 30, 2022 •

edited

Loading

tsibley commented Apr 1, 2022

genehack commented Jan 16, 2025

Standardize naming of strains/sequences/records #877

Standardize naming of strains/sequences/records #877

Comments

victorlin commented Mar 29, 2022 • edited Loading

emmahodcroft commented Mar 30, 2022 • edited Loading

tsibley commented Apr 1, 2022

genehack commented Jan 16, 2025

victorlin commented Mar 29, 2022 •

edited

Loading

emmahodcroft commented Mar 30, 2022 •

edited

Loading