Skip to content

SB Prepend organism

Steve Bond edited this page Oct 24, 2017 · 3 revisions

--prepend_organism, -ppo

Implemented in version 1.3

Description

Glean organism names from richly annotated formats, like GenBank or EMBL, and create a prefix that is attached to the gene id.

If there is a naming conflict because more then one species has the same prefix, then the prefix for each of those species will be further extended with an integer.

Argument

Prefix length ( int )

Optional. Specify the length of the new prefix prepended to each IDs (default = 4). This is the max length, but some may be smaller if the species name is not long enough, or extended with an integer if there are conflicts.

Examples

Input file: transcobalamin.gb

LOCUS       XP_009563828             180 aa            linear   VRT 14-OCT-2014
DEFINITION  PREDICTED: gastric intrinsic factor [Cuculus canorus].
ACCESSION   XP_009563828
VERSION     XP_009563828.1
DBLINK      BioProject: PRJNA263299
KEYWORDS    RefSeq; includes ab initio.
SOURCE      Cuculus canorus (common cuckoo)
  ORGANISM  Cuculus canorus
            .
COMMENT
FEATURES             Location/Qualifiers
ORIGIN
        1 mtlsvafsig vllalmggta ghecvashhl vskllqrmee sinvdekpnp sillamnlag
       61 dtdkenhkll lhqmkeeavn taekymssge valyvlalls scenpqqvha lsqtvdlisi
      121 lqkktdeevt sldvdgvpkt slfsvsldil glclanvggy qeasvalakk mldpeerrrs
//
LOCUS       XP_014016718             180 aa            linear   VRT 21-SEP-2015
DEFINITION  PREDICTED: transcobalamin-2-like [Salmo salar].
ACCESSION   XP_014016718
VERSION     XP_014016718.1
DBLINK      BioProject: PRJNA287919
KEYWORDS    RefSeq.
SOURCE      Salmo salar (Atlantic salmon)
  ORGANISM  Salmo salar
            .
COMMENT
FEATURES             Location/Qualifiers
ORIGIN
        1 mytlyivsgl lalvaskpcd pvgsepgell lslnknllrs legegtspnp svhlalrlst
       61 hhnlgmesdh lnalktylhn diesslvnnq pvvgllalyt lalkascydl ntltftvnqr
      121 setllthlkr qmeleknhia fsqrpltnyy qyslgvlalc vsgvrvnahv snklirvveh
//
LOCUS       XP_020035586             180 aa            linear   ROD 15-FEB-2017
DEFINITION  transcobalamin-2 [Castor canadensis].
ACCESSION   XP_020035586
VERSION     XP_020035586.1
DBLINK      BioProject: PRJNA371604
KEYWORDS    RefSeq.
SOURCE      Castor canadensis (American beaver)
  ORGANISM  Castor canadensis
            .
COMMENT
FEATURES             Location/Qualifiers
ORIGIN
        1 mghlgaflfl lgtlgavadi cespqadsqv vkklgqrllp wldrvspehl npslylglrl
       61 sslqagaked lylhglkldy qqcllrsddd ndnsecqtrp smgqlavyll alrancefvg
      121 grkgdklvsq lkwfledekk aigndhsgqp hssyyqygls ilalcvhqkr vhdsvvgkll
//

Usage example 1

Prefix organism identifiers to the accessions in the following BenBank records. Note that they are being converted to FASTA with the -o flag for clarity (not necessary, although the GenBank specification for accn length may be broken if trying to write the result back to GenBank).

$: sb transcobalamin.gb -ppo -o fasta

Output

# ######################## Prefix Mapping ######################## #
Ccan1: Cuculus canorus
Ccan2: Castor canadensis
Ssal: Salmo salar
# ################################################################ #

>Ccan1-XP_009563828.1 PREDICTED: gastric intrinsic factor [Cuculus canorus]
MTLSVAFSIGVLLALMGGTAGHECVASHHLVSKLLQRMEESINVDEKPNPSILLAMNLAG
DTDKENHKLLLHQMKEEAVNTAEKYMSSGEVALYVLALLSSCENPQQVHALSQTVDLISI
LQKKTDEEVTSLDVDGVPKTSLFSVSLDILGLCLANVGGYQEASVALAKKMLDPEERRRS

>Ssal-XP_014016718.1 PREDICTED: transcobalamin-2-like [Salmo salar]
MYTLYIVSGLLALVASKPCDPVGSEPGELLLSLNKNLLRSLEGEGTSPNPSVHLALRLST
HHNLGMESDHLNALKTYLHNDIESSLVNNQPVVGLLALYTLALKASCYDLNTLTFTVNQR
SETLLTHLKRQMELEKNHIAFSQRPLTNYYQYSLGVLALCVSGVRVNAHVSNKLIRVVEH

>Ccan2-XP_020035586.1 transcobalamin-2 [Castor canadensis]
MGHLGAFLFLLGTLGAVADICESPQADSQVVKKLGQRLLPWLDRVSPEHLNPSLYLGLRL
SSLQAGAKEDLYLHGLKLDYQQCLLRSDDDNDNSECQTRPSMGQLAVYLLALRANCEFVG
GRKGDKLVSQLKWFLEDEKKAIGNDHSGQPHSSYYQYGLSILALCVHQKRVHDSVVGKLL

Usage example 2

In the previous example, there was a conflict between Cuculus canorus and Castor canadensis. You can make the prefix longer to prevent the numbers if you prefer.

$: sb transcobalamin.gb -ppo 5 -o fasta

Output

# ######################## Prefix Mapping ######################## #
Ccana: Castor canadensis
Ccano: Cuculus canorus
Ssala: Salmo salar
# ################################################################ #

>Ccano-XP_009563828.1 PREDICTED: gastric intrinsic factor [Cuculus canorus]
MTLSVAFSIGVLLALMGGTAGHECVASHHLVSKLLQRMEESINVDEKPNPSILLAMNLAG
DTDKENHKLLLHQMKEEAVNTAEKYMSSGEVALYVLALLSSCENPQQVHALSQTVDLISI
LQKKTDEEVTSLDVDGVPKTSLFSVSLDILGLCLANVGGYQEASVALAKKMLDPEERRRS

>Ssala-XP_014016718.1 PREDICTED: transcobalamin-2-like [Salmo salar]
MYTLYIVSGLLALVASKPCDPVGSEPGELLLSLNKNLLRSLEGEGTSPNPSVHLALRLST
HHNLGMESDHLNALKTYLHNDIESSLVNNQPVVGLLALYTLALKASCYDLNTLTFTVNQR
SETLLTHLKRQMELEKNHIAFSQRPLTNYYQYSLGVLALCVSGVRVNAHVSNKLIRVVEH

>Ccana-XP_020035586.1 transcobalamin-2 [Castor canadensis]
MGHLGAFLFLLGTLGAVADICESPQADSQVVKKLGQRLLPWLDRVSPEHLNPSLYLGLRL
SSLQAGAKEDLYLHGLKLDYQQCLLRSDDDNDNSECQTRPSMGQLAVYLLALRANCEFVG
GRKGDKLVSQLKWFLEDEKKAIGNDHSGQPHSSYYQYGLSILALCVHQKRVHDSVVGKLL

Main Toolkit Pages





Further Reading

Clone this wiki locally