Skip to content
Steve Bond edited this page Oct 25, 2016 · 4 revisions

--find_CpG, -fcpg

Description

Predict regions under strong purifying selection based on high CpG content.

C5 methylated cytosine in CpG pairs can spontaneous deaminate, resulting in C-to-T substitutions. This causes a lower than expected frequency of CpG pairs in regions of DNA not under selective pressure, and can act as a rough proxy for coding and regulatory regions.

To predict CpG islands the sequence is broken up into 200 base pair 'sliding windows', and the observed CpG frequency is compared to the frequency that would be expected if the distribution were random within each window. To be considered an island, the total CG content of the region must be greater than 50%, and the CpG observed/expected ratio must be greater than 60%.

Depending on the output format selected, the islands will either be represented directly in the sequences using UPPERCASE (non-island regions will be in lowercase), or as annotated features (GenBank and EMBL format).

Caution: This technique is often applied to large sequences but the underlying SeqBuddy code has not been optimized. For reference, a 4 Mbps sequence requires about 2 Gb of memory and 5-10 minutes to run on a modern laptop.

Examples

Input file: Mle-Panxα11.fa

>Mle-Panxα11 cDNA - ML25999a.
atgctgatctcgagcttagttcagttcagcaggttatctccttttaaggagataactata
gatgacgggtgggaccaacttaacaggagtttcatgttcgttctgatggttatctgtgga
actatcgtcactgtccgacaacatacaggtaacatcatctcgtgtaacggtttcacaaaa
tacgacggatccttctccgaggactactgctggacgcagggactctacacgatcagggag
gcgtaccacgtgagcgacgtcaacgtcccttatcccggagttatcccggaggagatccca
ctctgtctaggagacaattgtgataagctagcaaacagcaacaccactcgagtgtatcat
ctgtggtaccagtggatccccttctacttctggctcgcttccgccgccttcttcctccct
tatctgatctacaagagatacggatttggagatatcaagcctctgatccacatgctgtac
aatcctctcgacggggacgaaggagtgaaggcagattcggagaaggcctcaatctggctt
tatcacagattctctatctacatgaacgagcattccatgtacgccaactttatggagaga
cacggaatcggcattctcgttatcgctatcaaggtgatgtacctgatcatctccgtccta
ctcatggtcatgaccgccatgatgttcgagctggctgacttcaagcagtacggtattgtg
tgggcccaacagtggcctgaccctcctgccaatgtcacaggaatcaaggacctgctcttc
cccaagatggttgcttgcgagatcaagagatggggacctactggtctggaggacgagaac
ggaatgtgtgtcctggcccccaacgtcatcaaccagtacatattcctcatcctctggtgg
gcccttgttttcaccattgtctctaacgttttcaacgtactggctggagttataagaatc
gtcttcatctatggttcttaccgccggatgttggctagcgctttcctcagagatgatcct
cattacaagaaggtctactacaagatcggcacctccggtcgggttatcctgaacatgctg
gcagcctccatctctccgacctgcttccaggagatcatgaacaacgtctgtccgcgtctc
atccgggcccacgtctccaagaagggacgaaacctgggcgacgaccccctgttgtag

Usage example 1

$: sb Mle-Panxα11.fa -fcpg

Output

########### Islands identified ###########
Mle-Panxα11: 181-419, 640-700, 895-898, 987-1197
##########################################

>Mle-Panxα11 cDNA - ML25999a.
atgctgatctcgagcttagttcagttcagcaggttatctccttttaaggagataactata
gatgacgggtgggaccaacttaacaggagtttcatgttcgttctgatggttatctgtgga
actatcgtcactgtccgacaacatacaggtaacatcatctcgtgtaacggtttcacaaaa
tACGACGGATCCTTCTCCGAGGACTACTGCTGGACGCAGGGACTCTACACGATCAGGGAG
GCGTACCACGTGAGCGACGTCAACGTCCCTTATCCCGGAGTTATCCCGGAGGAGATCCCA
CTCTGTCTAGGAGACAATTGTGATAAGCTAGCAAACAGCAACACCACTCGAGTGTATCAT
CTGTGGTACCAGTGGATCCCCTTCTACTTCTGGCTCGCTTCCGCCGCCTTCTTCCTCCCT
tatctgatctacaagagatacggatttggagatatcaagcctctgatccacatgctgtac
aatcctctcgacggggacgaaggagtgaaggcagattcggagaaggcctcaatctggctt
tatcacagattctctatctacatgaacgagcattccatgtacgccaactttatggagaga
cacggaatcggcattctcgttatcgctatcaaggtgatgtACCTGATCATCTCCGTCCTA
CTCATGGTCATGACCGCCATGATGTTCGAGCTGGCTGACTTcaagcagtacggtattgtg
tgggcccaacagtggcctgaccctcctgccaatgtcacaggaatcaaggacctgctcttc
cccaagatggttgcttgcgagatcaagagatggggacctactggtctggaggacgagaac
ggaatgtgtgtcctggcccccaacgtcatcaaccagtacatattcctcatcctctGGTGg
gcccttgttttcaccattgtctctaacgttttcaacgtactggctggagttataagaatc
gtcttcatctatggttcttaccgccggATGTTGGCTAGCGCTTTCCTCAGAGATGATCCT
CATTACAAGAAGGTCTACTACAAGATCGGCACCTCCGGTCGGGTTATCCTGAACATGCTG
GCAGCCTCCATCTCTCCGACCTGCTTCCAGGAGATCATGAACAACGTCTGTCCGCGTCTC
ATCCGGGCCCACGTCTCCAAGAAGGGACGAAACCTGGGCGACGACCCCCTGTTGTAG

Usage example 2

$: sb Mle-Panxα11.fa -fcpg -o genbank

Output

########### Islands identified ###########
Mle-Panxα11: 181-419, 640-700, 895-898, 987-1197
##########################################

LOCUS       Mle-Panxα11             1197 bp    DNA              UNK 01-JAN-1980
DEFINITION  Mle-Panxα11 cDNA - ML25999a.
ACCESSION   Mle-Panxα11
VERSION     Mle-Panxα11
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     CpG_island      182..419
                     /created_by="SeqBuddy"
     CpG_island      641..700
                     /created_by="SeqBuddy"
     CpG_island      896..898
                     /created_by="SeqBuddy"
     CpG_island      988..1197
                     /created_by="SeqBuddy"
ORIGIN
        1 atgctgatct cgagcttagt tcagttcagc aggttatctc cttttaagga gataactata
       61 gatgacgggt gggaccaact taacaggagt ttcatgttcg ttctgatggt tatctgtgga
      121 actatcgtca ctgtccgaca acatacaggt aacatcatct cgtgtaacgg tttcacaaaa
      181 tacgacggat ccttctccga ggactactgc tggacgcagg gactctacac gatcagggag
      241 gcgtaccacg tgagcgacgt caacgtccct tatcccggag ttatcccgga ggagatccca
      301 ctctgtctag gagacaattg tgataagcta gcaaacagca acaccactcg agtgtatcat
      361 ctgtggtacc agtggatccc cttctacttc tggctcgctt ccgccgcctt cttcctccct
      421 tatctgatct acaagagata cggatttgga gatatcaagc ctctgatcca catgctgtac
      481 aatcctctcg acggggacga aggagtgaag gcagattcgg agaaggcctc aatctggctt
      541 tatcacagat tctctatcta catgaacgag cattccatgt acgccaactt tatggagaga
      601 cacggaatcg gcattctcgt tatcgctatc aaggtgatgt acctgatcat ctccgtccta
      661 ctcatggtca tgaccgccat gatgttcgag ctggctgact tcaagcagta cggtattgtg
      721 tgggcccaac agtggcctga ccctcctgcc aatgtcacag gaatcaagga cctgctcttc
      781 cccaagatgg ttgcttgcga gatcaagaga tggggaccta ctggtctgga ggacgagaac
      841 ggaatgtgtg tcctggcccc caacgtcatc aaccagtaca tattcctcat cctctggtgg
      901 gcccttgttt tcaccattgt ctctaacgtt ttcaacgtac tggctggagt tataagaatc
      961 gtcttcatct atggttctta ccgccggatg ttggctagcg ctttcctcag agatgatcct
     1021 cattacaaga aggtctacta caagatcggc acctccggtc gggttatcct gaacatgctg
     1081 gcagcctcca tctctccgac ctgcttccag gagatcatga acaacgtctg tccgcgtctc
     1141 atccgggccc acgtctccaa gaagggacga aacctgggcg acgaccccct gttgtag
//

Main Toolkit Pages





Further Reading

Clone this wiki locally