-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mask gap characters #1048
Mask gap characters #1048
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1048 +/- ##
==========================================
+ Coverage 61.68% 61.81% +0.13%
==========================================
Files 52 52
Lines 6287 6309 +22
Branches 1583 1587 +4
==========================================
+ Hits 3878 3900 +22
Misses 2138 2138
Partials 271 271 ☔ View full report in Codecov by Sentry. |
Revisiting this today, I think there are two different ways to expose the (command line) arguments:
I prefer (1). Thoughts? |
I'd suggest (2) (or similarly worded options) as the more conventional approach here.
|
Maybe the verbose but very clear |
In theory I like 1 (simpler), but I agree with @tsibley that it's not necessarily easy to understand how the arguments interplay. For me I think the options are clear enough - but |
This extends `augur mask` to be able to (a) mask all gaps in all sequences (b) mask all gaps in sequences which have gaps but no Ns (c) mask terminal gaps A quick scan of 100k SARS-CoV-2 samples (samples over time & geography) reveals that around a third have gap characters but no Ns. This indicates bioinformatics pipelines using gap characters to represent missing data are widespread. Option (b) above is intended to fix this. The current implementation results in no noticeable slowdown. Testing on the above 100k data set with our common nCoV pipeline masking parameters takes ~13s. Masking gap characters doesn't change this.
e4c5924
to
a14d495
Compare
Update: I've implemented the ability to mask terminal gaps by reusing the code from https://github.com/nextstrain/ncov/blob/master/scripts/mask-alignment.py & added tests. This will allow our ncov pipeline to use The argument interface remains as (1) from above, however I'm more than happy to change to (2). |
@rneher points out that:
So do we need the script that we currently use in Also I felt like I understood this better earlier this week but now I've got myself a bit mixed up - the |
note that
so the question is whether this should be in mask rather than align? |
Sorry, I was behind on Slack but just caught up and I think I have been interpreting this wrongly: I assumed this was a column-based approach but now I think it might be a 'per sequence' approach. I'm a little less sure about this - it seems like outside of nCoV, how confident would one be that a sequence that uses only gaps is 'wrong'? But on the other hand, not sure I see another good solution. However, may raise the question of whether this should be a script for nCoV or something more widely useable in augur? |
This documentation was motivated by and should close nextstrain/augur#1043 There was a PR (now closed) related to this issue which expanded the functionality of augur mask: nextstrain/augur#1048
Revisiting this with fresh eyes I'm going to close this PR -- the introduced functionality is either available in existing augur commands ( I though about re-writing this PR to only implement |
This documentation was motivated by and should close nextstrain/augur#1043 There was a PR (now closed) related to this issue which expanded the functionality of augur mask: nextstrain/augur#1048
This documentation was motivated by and should close nextstrain/augur#1043 There was a PR (now closed) related to this issue which expanded the functionality of augur mask: nextstrain/augur#1048
This documentation was motivated by and should close nextstrain/augur#1043 There was a PR (now closed) related to this issue which expanded the functionality of augur mask: nextstrain/augur#1048
This extends
augur mask
to be able to(a) mask all gaps in all sequences
(b) mask all gaps in sequences which have gaps but no Ns.
A quick scan of 100k SARS-CoV-2 samples (samples over time & geography)
reveals that around a third have gap characters but no Ns. This
indicates bioinformatics pipelines using gap characters to represent
missing data are widespread. Option (b) above is intended to fix this.
The current implementation results in no noticeable slowdown. Testing on
the above 100k data set with our common nCoV pipeline masking parameters
takes ~13s. Masking gap characters doesn't change this.
Closes #1043
Note that our nCoV pipeline uses a script instead of
augur mask
. Looking at this I think the only extra functionality is the ability to mask terminal gaps. This would be trivial to add toaugur mask
(but if we are going to mask all gaps in all sequences then we don't need it). I would suggest adding this as another flag--mask-terminal-gaps
as it doesn't fit well with--mask-gaps
in this PR. An alternative would be to change this PR to use--mask-all-gaps
+--mask-gaps-if-no-Ns
flags. 🤔