Sequence-manipulation-tools

Just a few linux commands or python based scripts that I tend to use a lot

Mutating a DNA sequence at a known rate

You can use the script MutateDNAString.py to introduce n number of mutations in a given DNA sequence. For example if you give the code the DNA string 'AACA' and want 1 mutation in the string you might get 'TACA' as the string that is returned. Each base has an equal chance of getting picked to be mutated and a base can be converted to 3 other bases with equal probability. If you want to introduce multiple mutations (n>1), the code is written is such a way that each position can only be mutated once, but this is easily modifiable in the code.

To write this script I modified code obtained from this fantastic resource https://hplgit.github.io/bioinf-py/doc/pub/html/main_bioinf.html to fit my needs.

To run the code simply use the command and provide it with the string of your choice and the number of mutations you want to introduce:

python MutateDNAString.py --String AACA --nMut 2

To mutate all sequences in a fasta file at a given substitution rate, you will need BioPython to read the file. Also here please provide number of mutations you want as a percentage value and simply run the code:

python MutateDNAFasta.py --fasta test.fa --nMut 50 --output testOutput.fa

Sorting bed file

sort -k1,1 -k2,2n input.bed > input.sorted.bed

Extending or reducing coordinates in bed file

awk '{ print $1"\t"$2+50"\t"$3-50 }' Input.bed > InputExtended.bed (Extends by 50)

Counting # of sequences in fasta file

grep -c ">" seq.fa

Selecting only certain chromosomes from bed file for ML training/testing tasks

awk 'BEGIN{OFS="\t";} { if($1 == "chr4" || $1 == "chr2" || $1 == "chr18" || $1 == "chr20" || $1 == "chr9" || $1 == "chr13" ) { print }}' input.bed > inputTrain.bed

dinucleotide shuffle fasta

Check out the fasta-dinucleotide-shuffle-py3.py script in Meme-Suite

Concatenate fasta files with similar ID

seqkit concat function is really fast https://bioinf.shenwei.me/seqkit/usage/#concat

Apply some function to multiple files with common extension in linux (multiple peak/bed files)

for sample in ls /filelocation/*.peaks; do dir="/filelocation/" dir2="/Outputfilelocation/" base=$(basename $sample ".peaks") awk '{print $1"\t"$4 -50"\t"$4 + 50}' ${dir}/${base}.peaks > ${dir2}/${base}.peaks done

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
MutateDNAFasta.py		MutateDNAFasta.py
MutateDNAString.py		MutateDNAString.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequence-manipulation-tools

Mutating a DNA sequence at a known rate

Sorting bed file

Extending or reducing coordinates in bed file

Counting # of sequences in fasta file

Selecting only certain chromosomes from bed file for ML training/testing tasks

dinucleotide shuffle fasta

Concatenate fasta files with similar ID

Apply some function to multiple files with common extension in linux (multiple peak/bed files)

About

Releases

Packages

Languages

doczmp/Sequence-manipulation-tools

Folders and files

Latest commit

History

Repository files navigation

Sequence-manipulation-tools

Mutating a DNA sequence at a known rate

Sorting bed file

Extending or reducing coordinates in bed file

Counting # of sequences in fasta file

Selecting only certain chromosomes from bed file for ML training/testing tasks

dinucleotide shuffle fasta

Concatenate fasta files with similar ID

Apply some function to multiple files with common extension in linux (multiple peak/bed files)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages