You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In preparation for refactoring Augur's logic to read/write sequences and then add support for compressed sequences, the following section documents which subcommands read or write sequences by their input argument and how those sequences are read/written.
Places where Augur reads/writes sequences
parse
sequences
BioPython SeqIO.parse, iterate through all sequences
output-sequences
BioPython SeqIO.write one sequence at a time to a file handle
filter
sequences
BioPython SeqIO.index, random access of specific sequences
output
BioPython SeqIO.write, all sequences at once with an iterator to a filename
mask
sequences — called “sequences” but the expectation in the code is that this input is an alignment?
Multiple OS-level checks for whether the input file exists and is non-zero in size
BioPython SeqIO.parse , iterate through all sequences
output
BioPython SeqIO.write, one sequence at a time to a file handle
align
sequences
read_sequences function that accepts one or more input filenames, reads each file with SeqIO.parse, and returns a list of distinct sequence records. Raise an AlignmentError exception the first time it encounters a duplicate strain name with a different sequence (implicitly de-duplicates records with matching sequences and names).
write_seqs function writes an iterable all at once to a filename prior to running the alignment command. This function is currently a redundant wrapper around BioPython’s SeqIO.write that catches any FileNotFoundError exceptions and re-raises them as AlignmentError exceptions.
reference-sequence — expected to be a GenBank file with a name field instead of an id field?
read_reference function reads a single sequence from a GenBank or FASTA file (using filename extensions to guess format) using BioPython SeqIO.read
existing-alignment
read_alignment function that redundantly wraps BioPython AlignIO.read and catches all exceptions just to re-raise them as AlignmentError exceptions.
debug — implicit alignment outputs in FASTA format
shutils.copyfile to make copies of the input and/or output FASTA files
output
write_seqs
tree
alignment
FASTA input with mask sites
BioPython SeqIO.parse to loop through input alignment one record at a time
BioPython SeqIO.write, one sequence at a time to a file handle
VCF input: variable FASTA created with write_out_informative_fasta function that uses BioPython SeqIO.write to write a list of sequence records to a filename
refine
alignment
Passed as a filename to TreeTime and TreeAnc classes
vcf-reference
Passed as a filename to treetime.vcf_utils.read_vcf
ancestral
alignment
Passed as a filename to TreeAnc class
vcf-reference
Passed as a filename to treetime.vcf_utils.read_vcf
output-sequences
BioPython SeqIO.write a list of all sequences at once to a filename
translate
reference-sequence — GenBank or GFF file with annotations
BCBio GFF.parse for filename with .gff extension.
Bio SeqIO.read for all other filenames but assumes the input is in GenBank format (FASTA will not work).
reconstruct-sequences
vcf-aa-reference
BioPython SeqIO.parse, looping through each record from a file handle where sequences are expected to be (but not verified to be) amino acid sequences
clades
reference
Not used.
sequence-traits
vcf-reference
Passed as a filename to treetime.vcf_utils.read_vcf
distance
alignment
reconstruct_sequences.load_alignments function that accepts one or more input FASTA filenames and corresponding gene names, reads each FASTA file with BioPython AlignIO.read, and returns a dictionary of multiple sequence alignment objects indexed by gene name. Strangely, load_alignments is never used in the reconstruct_sequences.py module where it is defined.
titers sub
alignment
Uses reconstruct_sequences.load_alignments function as in distance.py
frequencies
alignments
Uses BioPython AlignIO.read to loop through each sequence and create a new MultipleSeqAlignment instance without internal nodes.
Iterates over one or more alignment input files by gene name (analogous to load_alignments but without loading all alignments in memory at once).
augur export v1
reference
Calls BioPython SeqIO.read from get_root_sequence function to load reference sequence.
Other notes
15 of 20 commands read or write sequences!
“FASTA” is inconsistently written throughout our code and docs as “FASTA”, “fasta”, and “Fasta”
Commands most frequently identify sequence file type by extension or not at all (assuming that a given file is the correct format).
The text was updated successfully, but these errors were encountered:
In preparation for refactoring Augur's logic to read/write sequences and then add support for compressed sequences, the following section documents which subcommands read or write sequences by their input argument and how those sequences are read/written.
Places where Augur reads/writes sequences
SeqIO.parse
, iterate through all sequencesSeqIO.write
one sequence at a time to a file handleSeqIO.index
, random access of specific sequencesSeqIO.write
, all sequences at once with an iterator to a filenameSeqIO.parse
, iterate through all sequencesSeqIO.write
, one sequence at a time to a file handleread_sequences
function that accepts one or more input filenames, reads each file withSeqIO.parse
, and returns a list of distinct sequence records. Raise anAlignmentError
exception the first time it encounters a duplicate strain name with a different sequence (implicitly de-duplicates records with matching sequences and names).write_seqs
function writes an iterable all at once to a filename prior to running the alignment command. This function is currently a redundant wrapper around BioPython’sSeqIO.write
that catches anyFileNotFoundError
exceptions and re-raises them asAlignmentError
exceptions.name
field instead of anid
field?read_reference
function reads a single sequence from a GenBank or FASTA file (using filename extensions to guess format) using BioPythonSeqIO.read
read_alignment
function that redundantly wraps BioPythonAlignIO.read
and catches all exceptions just to re-raise them asAlignmentError
exceptions.shutils.copyfile
to make copies of the input and/or output FASTA fileswrite_seqs
SeqIO.parse
to loop through input alignment one record at a timeSeqIO.write
, one sequence at a time to a file handlewrite_out_informative_fasta
function that uses BioPythonSeqIO.write
to write a list of sequence records to a filenameTreeTime
andTreeAnc
classestreetime.vcf_utils.read_vcf
TreeAnc
classtreetime.vcf_utils.read_vcf
SeqIO.write
a list of all sequences at once to a filenameGFF.parse
for filename with.gff
extension.SeqIO.read
for all other filenames but assumes the input is in GenBank format (FASTA will not work).SeqIO.parse
, looping through each record from a file handle where sequences are expected to be (but not verified to be) amino acid sequencestreetime.vcf_utils.read_vcf
reconstruct_sequences.load_alignments
function that accepts one or more input FASTA filenames and corresponding gene names, reads each FASTA file with BioPythonAlignIO.read
, and returns a dictionary of multiple sequence alignment objects indexed by gene name. Strangely,load_alignments
is never used in thereconstruct_sequences.py
module where it is defined.reconstruct_sequences.load_alignments
function as indistance.py
AlignIO.read
to loop through each sequence and create a newMultipleSeqAlignment
instance without internal nodes.load_alignments
but without loading all alignments in memory at once).SeqIO.read
fromget_root_sequence
function to load reference sequence.Other notes
The text was updated successfully, but these errors were encountered: