Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask gap characters #1048

Closed
wants to merge 1 commit into from
Closed

Mask gap characters #1048

wants to merge 1 commit into from

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Sep 19, 2022

This extends augur mask to be able to
(a) mask all gaps in all sequences
(b) mask all gaps in sequences which have gaps but no Ns.

A quick scan of 100k SARS-CoV-2 samples (samples over time & geography)
reveals that around a third have gap characters but no Ns. This
indicates bioinformatics pipelines using gap characters to represent
missing data are widespread. Option (b) above is intended to fix this.

The current implementation results in no noticeable slowdown. Testing on
the above 100k data set with our common nCoV pipeline masking parameters
takes ~13s. Masking gap characters doesn't change this.

Closes #1043

Note that our nCoV pipeline uses a script instead of augur mask. Looking at this I think the only extra functionality is the ability to mask terminal gaps. This would be trivial to add to augur mask (but if we are going to mask all gaps in all sequences then we don't need it). I would suggest adding this as another flag --mask-terminal-gaps as it doesn't fit well with --mask-gaps in this PR. An alternative would be to change this PR to use --mask-all-gaps + --mask-gaps-if-no-Ns flags. 🤔

@codecov
Copy link

codecov bot commented Sep 19, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.81%. Comparing base (862bbb6) to head (a14d495).
Report is 1221 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1048      +/-   ##
==========================================
+ Coverage   61.68%   61.81%   +0.13%     
==========================================
  Files          52       52              
  Lines        6287     6309      +22     
  Branches     1583     1587       +4     
==========================================
+ Hits         3878     3900      +22     
  Misses       2138     2138              
  Partials      271      271              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jameshadfield
Copy link
Member Author

Revisiting this today, I think there are two different ways to expose the (command line) arguments:

  1. --mask-gaps {all, if-only-gaps, terminal} and allow >1 values, e.g. --mask-gaps if-only-gaps terminal
  2. Three flags: --mask-all-gaps, --mask-gaps-if-no-ns, --mask-terminal-gaps

I prefer (1). Thoughts?

@tsibley
Copy link
Member

tsibley commented Sep 19, 2022

--mask-gaps if-only-gaps terminal to me reads as the terminal option conditioned by if-only-gaps or something. It doesn't parse to me as the union of their effects (e.g. always mask terminal gaps, and mask non-terminal gaps if the sequence doesn't contain any Ns).

I'd suggest (2) (or similarly worded options) as the more conventional approach here.

--mask-gaps-if-no-ns--mask-all-gaps-if-no-Ns?

@tsibley
Copy link
Member

tsibley commented Sep 19, 2022

Maybe the verbose but very clear --mask-all-gaps-in-sequences-without-Ns?

@emmahodcroft
Copy link
Member

emmahodcroft commented Sep 20, 2022

In theory I like 1 (simpler), but I agree with @tsibley that it's not necessarily easy to understand how the arguments interplay. For me I think the options are clear enough - but --mask-all-gaps-if-no-Ns is more confusing for me (I think because it references the 'all gaps' option) - I'd say it needs to be either the short original or the very verbose third option.

This extends `augur mask` to be able to
(a) mask all gaps in all sequences
(b) mask all gaps in sequences which have gaps but no Ns
(c) mask terminal gaps

A quick scan of 100k SARS-CoV-2 samples (samples over time & geography)
reveals that around a third have gap characters but no Ns. This
indicates bioinformatics pipelines using gap characters to represent
missing data are widespread. Option (b) above is intended to fix this.

The current implementation results in no noticeable slowdown. Testing on
the above 100k data set with our common nCoV pipeline masking parameters
takes ~13s. Masking gap characters doesn't change this.
@jameshadfield
Copy link
Member Author

Update: I've implemented the ability to mask terminal gaps by reusing the code from https://github.com/nextstrain/ncov/blob/master/scripts/mask-alignment.py & added tests. This will allow our ncov pipeline to use augur mask again.

The argument interface remains as (1) from above, however I'm more than happy to change to (2).

@emmahodcroft
Copy link
Member

@rneher points out that:

augur ancestral treats terminal gaps as missing data by default.

So do we need the script that we currently use in ncov to mask terminal gaps (and/or the propose new flag)?

Also I felt like I understood this better earlier this week but now I've got myself a bit mixed up - the --mask-gaps-if-no-ns flag would turn a column of gaps to Ns only if it consists only of gaps? (So not if it has Ns or bases?) Or only not if it has Ns?
If the latter, is there a risk here that we fill in true but well-defined deletions (where nobody is calling them as N)? Or is that near-impossible?

@rneher
Copy link
Member

rneher commented Sep 21, 2022

note that augur align has this functionality.

augur align -h 
[...]
 --fill-gaps           If gaps represent missing data rather than true indels, replace by N after aligning. (default: False)

so the question is whether this should be in mask rather than align?

@emmahodcroft
Copy link
Member

Sorry, I was behind on Slack but just caught up and I think I have been interpreting this wrongly: I assumed this was a column-based approach but now I think it might be a 'per sequence' approach. I'm a little less sure about this - it seems like outside of nCoV, how confident would one be that a sequence that uses only gaps is 'wrong'? But on the other hand, not sure I see another good solution. However, may raise the question of whether this should be a script for nCoV or something more widely useable in augur?

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request May 25, 2023
This documentation was motivated by and should close
nextstrain/augur#1043

There was a PR (now closed) related to this issue which expanded the
functionality of augur mask: nextstrain/augur#1048
@jameshadfield
Copy link
Member Author

jameshadfield commented May 25, 2023

Revisiting this with fresh eyes I'm going to close this PR -- the introduced functionality is either available in existing augur commands (augur align --fill-gaps) or is pathogen-specific and can be implemented in script. I've instead documented our current behaviour in this PR.

I though about re-writing this PR to only implement augur mask --mask-gaps, which would do largely the same thing as augur align --fill-gaps but would be useful in the cases where we are calling the aligner directly (e.g. using nextalign). Happy to do this if there's interest.

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request May 25, 2023
This documentation was motivated by and should close
nextstrain/augur#1043

There was a PR (now closed) related to this issue which expanded the
functionality of augur mask: nextstrain/augur#1048
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request May 25, 2023
This documentation was motivated by and should close
nextstrain/augur#1043

There was a PR (now closed) related to this issue which expanded the
functionality of augur mask: nextstrain/augur#1048
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request May 26, 2023
This documentation was motivated by and should close
nextstrain/augur#1043

There was a PR (now closed) related to this issue which expanded the
functionality of augur mask: nextstrain/augur#1048
@victorlin victorlin deleted the feat/mask-gaps branch June 30, 2023 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Add support for masking gap characters
4 participants