Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify default values for inference of ambiguous bases #613

Merged
merged 2 commits into from
Aug 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions augur/ancestral.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,9 @@ def register_arguments(parser):
parser.add_argument('--vcf-reference', type=str, help='fasta file of the sequence the VCF was mapped to')
parser.add_argument('--output-vcf', type=str, help='name of output VCF file which will include ancestral seqs')
ambiguous = parser.add_mutually_exclusive_group()
ambiguous.add_argument('--keep-ambiguous', action="store_false", dest='infer_ambiguous',
ambiguous.add_argument('--keep-ambiguous', action="store_true",
help='do not infer nucleotides at ambiguous (N) sites on tip sequences (leave as N).')
ambiguous.add_argument('--infer-ambiguous', action="store_true",
ambiguous.add_argument('--infer-ambiguous', action="store_true", default=True,
help='infer nucleotides at ambiguous (N,W,R,..) sites on tip sequences and replace with most likely state.')
parser.add_argument('--keep-overhangs', action="store_true", default=False,
help='do not infer nucleotides for gaps (-) on either side of the alignment')
Expand Down Expand Up @@ -159,9 +159,14 @@ def run(args):
print("ERROR: this version of augur requires TreeTime 0.7 or later.")
return 1

# Infer ambiguous bases if the user has requested that we infer them (either
# explicitly or by default) and the user has not explicitly requested that
# we keep them.
infer_ambiguous = args.infer_ambiguous and not args.keep_ambiguous

tt = ancestral_sequence_inference(tree=T, aln=aln, ref=ref, marginal=args.inference,
fill_overhangs = not(args.keep_overhangs),
infer_tips = args.infer_ambiguous)
infer_tips = infer_ambiguous)

character_map = {}
for x in tt.gtr.profile_map:
Expand All @@ -173,7 +178,7 @@ def run(args):
character_map[x] = x

anc_seqs['nodes'] = collect_mutations_and_sequences(tt, full_sequences=not is_vcf,
infer_tips=args.infer_ambiguous, character_map=character_map)
infer_tips=infer_ambiguous, character_map=character_map)
# add reference sequence to json structure. This is the sequence with
# respect to which mutations on the tree are defined.
if is_vcf:
Expand Down
58 changes: 58 additions & 0 deletions tests/functional/ancestral.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
Integration tests for augur ancestral.

$ pushd "$TESTDIR" > /dev/null
$ export AUGUR="../../bin/augur"

Infer ancestral sequences for the given tree and alignment.
The default is to infer ambiguous bases, so there should not be N bases in the inferred output sequences.

$ ${AUGUR} ancestral \
> --tree ancestral/tree_raw.nwk \
> --alignment ancestral/aligned.fasta \
> --output-node-data "$TMP/ancestral_mutations.json" \
> --output-sequences "$TMP/ancestral_sequences.fasta" > /dev/null
*/treetime/aa_models.py:108: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray (glob)
[0.800038509951648, 1.20274751778601, 1.55207513886163, 1.46600946033173, 0.830022143283238, 1.5416250309563, 1.53255698189437, 1.41208067821187, 1.47469999960758, 0.351200119909572, 0.570542199221932, 1.21378822764856, 0.609532859331199, 0.692733248746636, 1.40887880416009, 1.02015839286433, 0.807404666228614, 1.268589159299, 0.933095433689795]

$ grep N "$TMP/ancestral_sequences.fasta"
>NODE_0000000

Infer ancestral sequences for the given tree and alignment, explicitly requesting that ambiguous bases are inferred.
There should not be N bases in the inferred output sequences.

$ ${AUGUR} ancestral \
> --tree ancestral/tree_raw.nwk \
> --alignment ancestral/aligned.fasta \
> --infer-ambiguous \
> --output-node-data "$TMP/ancestral_mutations.json" \
> --output-sequences "$TMP/ancestral_sequences.fasta" > /dev/null
*/treetime/aa_models.py:108: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray (glob)
[0.800038509951648, 1.20274751778601, 1.55207513886163, 1.46600946033173, 0.830022143283238, 1.5416250309563, 1.53255698189437, 1.41208067821187, 1.47469999960758, 0.351200119909572, 0.570542199221932, 1.21378822764856, 0.609532859331199, 0.692733248746636, 1.40887880416009, 1.02015839286433, 0.807404666228614, 1.268589159299, 0.933095433689795]

$ grep N "$TMP/ancestral_sequences.fasta"
>NODE_0000000

Infer ancestral sequences for the given tree and alignment, explicitly requesting that ambiguous bases are NOT inferred.
There be N bases in the inferred output sequences.

$ ${AUGUR} ancestral \
> --tree ancestral/tree_raw.nwk \
> --alignment ancestral/aligned.fasta \
> --keep-ambiguous \
> --output-node-data "$TMP/ancestral_mutations.json" \
> --output-sequences "$TMP/ancestral_sequences.fasta" > /dev/null
*/treetime/aa_models.py:108: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray (glob)
[0.800038509951648, 1.20274751778601, 1.55207513886163, 1.46600946033173, 0.830022143283238, 1.5416250309563, 1.53255698189437, 1.41208067821187, 1.47469999960758, 0.351200119909572, 0.570542199221932, 1.21378822764856, 0.609532859331199, 0.692733248746636, 1.40887880416009, 1.02015839286433, 0.807404666228614, 1.268589159299, 0.933095433689795]

$ grep N "$TMP/ancestral_sequences.fasta"
>NODE_0000000
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGTATCAACAGGTTTT
AGCTGTGGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGTATCAACAGGTTTT
AGCTGTGGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNN

$ popd > /dev/null
Loading