Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize naming of strains/sequences/records #877

Closed
victorlin opened this issue Mar 29, 2022 · 3 comments
Closed

Standardize naming of strains/sequences/records #877

victorlin opened this issue Mar 29, 2022 · 3 comments

Comments

@victorlin
Copy link
Member

victorlin commented Mar 29, 2022

from #750

Maybe for later PRs: Standardize naming of "strains", "sequences", and "records"

This is something that's mildly bugged me while going through the codebase. Would love to standardize and this can probably be done without changing the experience for users, since it's mostly about internal variable names and documentation.

If I had to pick from the 3, my vote is on record for the following reasons:

  1. I think we want this term to refer to a data point that is either metadata, sequence, or a combination of both.
  2. strain seems too specific to the virology domain, and within that domain there's already some nuance in strain/mutant/variant (?)
  3. sequence is already a term for the actual DNA/RNA sequence.
  4. Biopython has a SeqRecord, meant for sequence (+ optional metadata). This sounds close to our use case. In practice SeqRecord instances are often variables named record.
@emmahodcroft
Copy link
Member

emmahodcroft commented Mar 30, 2022

I agree it would be nice to have some standardization there, but we may want to retain using a mix of the three as they can convey slightly different things. For me, intuitively, 'record' would likely imply metadata and not necessarily the sequence, whereas 'sequence' implies more that this is referencing the actual DNA sequence. For example, saying "we'll exclude all records with more than 10 mutations" would, to me, read very strangely. The reverse is a little less strict to my ear (eye?): "We'll exclude all sequences with region 'Europe'" would not raise my eyebrows.
This may come from past experience - I'm not unused to working with data where I have more clinical records or diagnostic reports, etc, and only some of these have attached sequences. Given what I do, I generally am only concerned with the ones with sequences, though often might do basic summary/manipulation on the entire set to give an overview.

While I think it's probably not worth changing strain in general in our internal coding/columns (this would likely break a lot of things for a lot of people), I would be happy to avoid this somewhat in documentation to be more precise, since strain can also mean a distinct pathogen group ("a new strain of X").

@tsibley
Copy link
Member

tsibley commented Apr 1, 2022

Broadly agree with @emmahodcroft here.

record is also as generic as it gets, second only to data, so I'm reticent to prefer that.

If we do choose to further converge on terms (whatever the terms are), I recommend we do it over time as we touch parts of the codebase for other work rather than try to make a sweeping change all at once. That is, make the preferred terms a part of our (informal) codebase "policy" for new/changed code. This avoids creating new work, is less effort since it's incremental, and is less likely to introduce accidental breakage since its integrated into related work that would be getting tested/reviewed more closely than a big find/replace would.

@genehack
Copy link
Contributor

Based on Tom's last comment, the lack of movement here, and the general trend in the group towards the progressive/over-time approach Tom advocated, I think this issue can be closed.

@genehack genehack closed this as not planned Won't fix, can't repro, duplicate, stale Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

4 participants