-
Notifications
You must be signed in to change notification settings - Fork 23
SB Find CpG
Predict regions under strong purifying selection based on high CpG content.
C5 methylated cytosine in CpG pairs can spontaneous deaminate, resulting in C-to-T substitutions. This causes a lower than expected frequency of CpG pairs in regions of DNA not under selective pressure, and can act as a rough proxy for coding and regulatory regions.
To predict CpG islands the sequence is broken up into 200 base pair 'sliding windows', and the observed CpG frequency is compared to the frequency that would be expected if the distribution were random within each window. To be considered an island, the total CG content of the region must be greater than 50%, and the CpG observed/expected ratio must be greater than 60%.
Depending on the output format selected, the islands will either be represented directly in the sequences using UPPERCASE (non-island regions will be in lowercase), or as annotated features (GenBank and EMBL format).
Caution: This technique is often applied to large sequences but the underlying SeqBuddy code has not been optimized. For reference, a 4 Mbps sequence requires about 2 Gb of memory and 5-10 minutes to run on a modern laptop.
>Mle-Panxα11 cDNA - ML25999a.
atgctgatctcgagcttagttcagttcagcaggttatctccttttaaggagataactata
gatgacgggtgggaccaacttaacaggagtttcatgttcgttctgatggttatctgtgga
actatcgtcactgtccgacaacatacaggtaacatcatctcgtgtaacggtttcacaaaa
tacgacggatccttctccgaggactactgctggacgcagggactctacacgatcagggag
gcgtaccacgtgagcgacgtcaacgtcccttatcccggagttatcccggaggagatccca
ctctgtctaggagacaattgtgataagctagcaaacagcaacaccactcgagtgtatcat
ctgtggtaccagtggatccccttctacttctggctcgcttccgccgccttcttcctccct
tatctgatctacaagagatacggatttggagatatcaagcctctgatccacatgctgtac
aatcctctcgacggggacgaaggagtgaaggcagattcggagaaggcctcaatctggctt
tatcacagattctctatctacatgaacgagcattccatgtacgccaactttatggagaga
cacggaatcggcattctcgttatcgctatcaaggtgatgtacctgatcatctccgtccta
ctcatggtcatgaccgccatgatgttcgagctggctgacttcaagcagtacggtattgtg
tgggcccaacagtggcctgaccctcctgccaatgtcacaggaatcaaggacctgctcttc
cccaagatggttgcttgcgagatcaagagatggggacctactggtctggaggacgagaac
ggaatgtgtgtcctggcccccaacgtcatcaaccagtacatattcctcatcctctggtgg
gcccttgttttcaccattgtctctaacgttttcaacgtactggctggagttataagaatc
gtcttcatctatggttcttaccgccggatgttggctagcgctttcctcagagatgatcct
cattacaagaaggtctactacaagatcggcacctccggtcgggttatcctgaacatgctg
gcagcctccatctctccgacctgcttccaggagatcatgaacaacgtctgtccgcgtctc
atccgggcccacgtctccaagaagggacgaaacctgggcgacgaccccctgttgtag
$: sb Mle-Panxα11.fa -fcpg
########### Islands identified ###########
Mle-Panxα11: 181-419, 640-700, 895-898, 987-1197
##########################################
>Mle-Panxα11 cDNA - ML25999a.
atgctgatctcgagcttagttcagttcagcaggttatctccttttaaggagataactata
gatgacgggtgggaccaacttaacaggagtttcatgttcgttctgatggttatctgtgga
actatcgtcactgtccgacaacatacaggtaacatcatctcgtgtaacggtttcacaaaa
tACGACGGATCCTTCTCCGAGGACTACTGCTGGACGCAGGGACTCTACACGATCAGGGAG
GCGTACCACGTGAGCGACGTCAACGTCCCTTATCCCGGAGTTATCCCGGAGGAGATCCCA
CTCTGTCTAGGAGACAATTGTGATAAGCTAGCAAACAGCAACACCACTCGAGTGTATCAT
CTGTGGTACCAGTGGATCCCCTTCTACTTCTGGCTCGCTTCCGCCGCCTTCTTCCTCCCT
tatctgatctacaagagatacggatttggagatatcaagcctctgatccacatgctgtac
aatcctctcgacggggacgaaggagtgaaggcagattcggagaaggcctcaatctggctt
tatcacagattctctatctacatgaacgagcattccatgtacgccaactttatggagaga
cacggaatcggcattctcgttatcgctatcaaggtgatgtACCTGATCATCTCCGTCCTA
CTCATGGTCATGACCGCCATGATGTTCGAGCTGGCTGACTTcaagcagtacggtattgtg
tgggcccaacagtggcctgaccctcctgccaatgtcacaggaatcaaggacctgctcttc
cccaagatggttgcttgcgagatcaagagatggggacctactggtctggaggacgagaac
ggaatgtgtgtcctggcccccaacgtcatcaaccagtacatattcctcatcctctGGTGg
gcccttgttttcaccattgtctctaacgttttcaacgtactggctggagttataagaatc
gtcttcatctatggttcttaccgccggATGTTGGCTAGCGCTTTCCTCAGAGATGATCCT
CATTACAAGAAGGTCTACTACAAGATCGGCACCTCCGGTCGGGTTATCCTGAACATGCTG
GCAGCCTCCATCTCTCCGACCTGCTTCCAGGAGATCATGAACAACGTCTGTCCGCGTCTC
ATCCGGGCCCACGTCTCCAAGAAGGGACGAAACCTGGGCGACGACCCCCTGTTGTAG
$: sb Mle-Panxα11.fa -fcpg -o genbank
########### Islands identified ###########
Mle-Panxα11: 181-419, 640-700, 895-898, 987-1197
##########################################
LOCUS Mle-Panxα11 1197 bp DNA UNK 01-JAN-1980
DEFINITION Mle-Panxα11 cDNA - ML25999a.
ACCESSION Mle-Panxα11
VERSION Mle-Panxα11
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
CpG_island 182..419
/created_by="SeqBuddy"
CpG_island 641..700
/created_by="SeqBuddy"
CpG_island 896..898
/created_by="SeqBuddy"
CpG_island 988..1197
/created_by="SeqBuddy"
ORIGIN
1 atgctgatct cgagcttagt tcagttcagc aggttatctc cttttaagga gataactata
61 gatgacgggt gggaccaact taacaggagt ttcatgttcg ttctgatggt tatctgtgga
121 actatcgtca ctgtccgaca acatacaggt aacatcatct cgtgtaacgg tttcacaaaa
181 tacgacggat ccttctccga ggactactgc tggacgcagg gactctacac gatcagggag
241 gcgtaccacg tgagcgacgt caacgtccct tatcccggag ttatcccgga ggagatccca
301 ctctgtctag gagacaattg tgataagcta gcaaacagca acaccactcg agtgtatcat
361 ctgtggtacc agtggatccc cttctacttc tggctcgctt ccgccgcctt cttcctccct
421 tatctgatct acaagagata cggatttgga gatatcaagc ctctgatcca catgctgtac
481 aatcctctcg acggggacga aggagtgaag gcagattcgg agaaggcctc aatctggctt
541 tatcacagat tctctatcta catgaacgag cattccatgt acgccaactt tatggagaga
601 cacggaatcg gcattctcgt tatcgctatc aaggtgatgt acctgatcat ctccgtccta
661 ctcatggtca tgaccgccat gatgttcgag ctggctgact tcaagcagta cggtattgtg
721 tgggcccaac agtggcctga ccctcctgcc aatgtcacag gaatcaagga cctgctcttc
781 cccaagatgg ttgcttgcga gatcaagaga tggggaccta ctggtctgga ggacgagaac
841 ggaatgtgtg tcctggcccc caacgtcatc aaccagtaca tattcctcat cctctggtgg
901 gcccttgttt tcaccattgt ctctaacgtt ttcaacgtac tggctggagt tataagaatc
961 gtcttcatct atggttctta ccgccggatg ttggctagcg ctttcctcag agatgatcct
1021 cattacaaga aggtctacta caagatcggc acctccggtc gggttatcct gaacatgctg
1081 gcagcctcca tctctccgac ctgcttccag gagatcatga acaacgtctg tccgcgtctc
1141 atccgggccc acgtctccaa gaagggacga aacctgggcg acgaccccct gttgtag
//