runFaidx is a wrapper for the Samtools faidx tool to improve ease of use.
Samtool faidx extracts a sequence from a fasta file (faa, fna, etc.) using the sequence ID; however, because the command takes a list of sequence IDs, the command becomes unwieldy when extracting a large number of sequences. Further, a new command is required for each grouping of sequence extracts.
This tool allows users to extract any number of sequences and group the output into unique files. Sequences can be specified by sequence IDs or gene names in the command line or listed in an input file.
- Samtools v1.5
- python
runFaidx.py can extract sequences using four modes.
runFaidx.py is currently unable to search multiple files for the sequence IDs. If the sequences are present across multiple files, first join the files into one using the cat
command.
cat *faa > all.faa
Each sequence element in the fasta file should have its own sequence ID.
python runFaidx.py table -f file.fasta -i lists.tsv
The text file is tab deliminated and should have the following format:
group_1 ID_1 ID_2 ... ID_n
group_2 ID_A ID_B ... ID_n
Where group
is a unique identifier used to label the output file that contains the extracted sequences. The output file of sequences will be for the set of identifiers on that line. Ensure that after the group name and between each ID is a tab. For example, the output file group_1.fasta will contain the sequences for ID_1, ID_2, ... ID_n. There is no limit to the number of rows or columns within the text file.
As input, this mode requires a Prokka generated .tsv file that describes the genes to parse.
python runFaidx.py gene -g genename -f file.fasta -i prokka.tsv
runFaidx.py gene offers three modes for finding the genes of interest that can be specified using the -m
flag. exact
requires an exact match. gene
mode will match only the prefix of the gene. close
will grab any gene that contains the given string.
Let a genome contain the genes esxA_1, esxA_2, and esxH. Running runFaidx.py gene with the parameters -g esxA_1 -m exact
will extract the sequence for esxA_1. Using the parameters -g esxA -m gene
will extract the sequences for esxA_1 and esxA_2. Running with the parameters -g esx -m close
will export the sequences for all three genes.
As input, this mode takes a text file where each line contains a single gene ID. When using this mode, do not include a header line.
python runFaidx.py file -f file.fasta -i file.txt
Roary is a pan-genome analysis tool that outputs a file called gene_presence_absence.csv where the first column contains a unique identifier and the presence of that identified element in a set of genomes is indicated in subsequent columns.
runFaidx.py roary can accepts the original Roary file or a copy containing a subset of rows, as long as the columns are in tact and the file is comma deliminated.
python runFaidx.py roary -f file.fasta -i gene_presence_absence.csv
runFaidx.py pangenome mode will extract one instance of each core and accessory gene sequence present in a collection of genomes. As input, this mode requires the Roary output file gene_presence_absence.csv or a copy of this file containing a subset of the rows. A subsetted copy must have the same columns and be comma deliminated.
python runFaidx.py pangenome -f file.fasta -l gene_presence_absence.csv
runFaidx.py pangenome will output two files: access.fasta
and core.fasta
. core.fasta
will contain a representative sequence for each gene that is found in all samples in the input file while access.fasta
will contain a representative of each gene missing from at least one of the samples. The representative sequence is determined by the sequence ID of the first sample to contain a sequence for that set. For example, consider columns1 and 15-17 of an example roary output:
Gene sample_1 sample_2 sample_3
gene1 g1_ID1 g1_ID2 g1_ID3
gene2 g2_ID2 g2_ID3
In this case, core.fasta
will contain the sequence for g1_ID1 and access.fasta
will contain the sequence for g2_ID2.
runFaidx.py will export the sequences to a directory called subset_faidx
. To change the name of this directory use the -o
flag. By default, runFaidx.py will create this directory in the current working directory. To change the path, use the -p
flag.
Each exported file will have the prefix subset
but this can be changed using the -s
flag.