-
Notifications
You must be signed in to change notification settings - Fork 23
SB Prepend organism
Glean organism names from richly annotated formats, like GenBank or EMBL, and create a prefix that is attached to the gene id.
If there is a naming conflict because more then one species has the same prefix, then the prefix for each of those species will be further extended with an integer.
Optional. Specify the length of the new prefix prepended to each IDs (default = 4). This is the max length, but some may be smaller if the species name is not long enough, or extended with an integer if there are conflicts.
LOCUS XP_009563828 180 aa linear VRT 14-OCT-2014
DEFINITION PREDICTED: gastric intrinsic factor [Cuculus canorus].
ACCESSION XP_009563828
VERSION XP_009563828.1
DBLINK BioProject: PRJNA263299
KEYWORDS RefSeq; includes ab initio.
SOURCE Cuculus canorus (common cuckoo)
ORGANISM Cuculus canorus
.
COMMENT
FEATURES Location/Qualifiers
ORIGIN
1 mtlsvafsig vllalmggta ghecvashhl vskllqrmee sinvdekpnp sillamnlag
61 dtdkenhkll lhqmkeeavn taekymssge valyvlalls scenpqqvha lsqtvdlisi
121 lqkktdeevt sldvdgvpkt slfsvsldil glclanvggy qeasvalakk mldpeerrrs
//
LOCUS XP_014016718 180 aa linear VRT 21-SEP-2015
DEFINITION PREDICTED: transcobalamin-2-like [Salmo salar].
ACCESSION XP_014016718
VERSION XP_014016718.1
DBLINK BioProject: PRJNA287919
KEYWORDS RefSeq.
SOURCE Salmo salar (Atlantic salmon)
ORGANISM Salmo salar
.
COMMENT
FEATURES Location/Qualifiers
ORIGIN
1 mytlyivsgl lalvaskpcd pvgsepgell lslnknllrs legegtspnp svhlalrlst
61 hhnlgmesdh lnalktylhn diesslvnnq pvvgllalyt lalkascydl ntltftvnqr
121 setllthlkr qmeleknhia fsqrpltnyy qyslgvlalc vsgvrvnahv snklirvveh
//
LOCUS XP_020035586 180 aa linear ROD 15-FEB-2017
DEFINITION transcobalamin-2 [Castor canadensis].
ACCESSION XP_020035586
VERSION XP_020035586.1
DBLINK BioProject: PRJNA371604
KEYWORDS RefSeq.
SOURCE Castor canadensis (American beaver)
ORGANISM Castor canadensis
.
COMMENT
FEATURES Location/Qualifiers
ORIGIN
1 mghlgaflfl lgtlgavadi cespqadsqv vkklgqrllp wldrvspehl npslylglrl
61 sslqagaked lylhglkldy qqcllrsddd ndnsecqtrp smgqlavyll alrancefvg
121 grkgdklvsq lkwfledekk aigndhsgqp hssyyqygls ilalcvhqkr vhdsvvgkll
//
Prefix organism identifiers to the accessions in the following BenBank records. Note that they are being converted to FASTA with the -o flag for clarity (not necessary, although the GenBank specification for accn length may be broken if trying to write the result back to GenBank).
$: sb transcobalamin.gb -ppo -o fasta
# ######################## Prefix Mapping ######################## #
Ccan1: Cuculus canorus
Ccan2: Castor canadensis
Ssal: Salmo salar
# ################################################################ #
>Ccan1-XP_009563828.1 PREDICTED: gastric intrinsic factor [Cuculus canorus]
MTLSVAFSIGVLLALMGGTAGHECVASHHLVSKLLQRMEESINVDEKPNPSILLAMNLAG
DTDKENHKLLLHQMKEEAVNTAEKYMSSGEVALYVLALLSSCENPQQVHALSQTVDLISI
LQKKTDEEVTSLDVDGVPKTSLFSVSLDILGLCLANVGGYQEASVALAKKMLDPEERRRS
>Ssal-XP_014016718.1 PREDICTED: transcobalamin-2-like [Salmo salar]
MYTLYIVSGLLALVASKPCDPVGSEPGELLLSLNKNLLRSLEGEGTSPNPSVHLALRLST
HHNLGMESDHLNALKTYLHNDIESSLVNNQPVVGLLALYTLALKASCYDLNTLTFTVNQR
SETLLTHLKRQMELEKNHIAFSQRPLTNYYQYSLGVLALCVSGVRVNAHVSNKLIRVVEH
>Ccan2-XP_020035586.1 transcobalamin-2 [Castor canadensis]
MGHLGAFLFLLGTLGAVADICESPQADSQVVKKLGQRLLPWLDRVSPEHLNPSLYLGLRL
SSLQAGAKEDLYLHGLKLDYQQCLLRSDDDNDNSECQTRPSMGQLAVYLLALRANCEFVG
GRKGDKLVSQLKWFLEDEKKAIGNDHSGQPHSSYYQYGLSILALCVHQKRVHDSVVGKLL
In the previous example, there was a conflict between Cuculus canorus and Castor canadensis. You can make the prefix longer to prevent the numbers if you prefer.
$: sb transcobalamin.gb -ppo 5 -o fasta
# ######################## Prefix Mapping ######################## #
Ccana: Castor canadensis
Ccano: Cuculus canorus
Ssala: Salmo salar
# ################################################################ #
>Ccano-XP_009563828.1 PREDICTED: gastric intrinsic factor [Cuculus canorus]
MTLSVAFSIGVLLALMGGTAGHECVASHHLVSKLLQRMEESINVDEKPNPSILLAMNLAG
DTDKENHKLLLHQMKEEAVNTAEKYMSSGEVALYVLALLSSCENPQQVHALSQTVDLISI
LQKKTDEEVTSLDVDGVPKTSLFSVSLDILGLCLANVGGYQEASVALAKKMLDPEERRRS
>Ssala-XP_014016718.1 PREDICTED: transcobalamin-2-like [Salmo salar]
MYTLYIVSGLLALVASKPCDPVGSEPGELLLSLNKNLLRSLEGEGTSPNPSVHLALRLST
HHNLGMESDHLNALKTYLHNDIESSLVNNQPVVGLLALYTLALKASCYDLNTLTFTVNQR
SETLLTHLKRQMELEKNHIAFSQRPLTNYYQYSLGVLALCVSGVRVNAHVSNKLIRVVEH
>Ccana-XP_020035586.1 transcobalamin-2 [Castor canadensis]
MGHLGAFLFLLGTLGAVADICESPQADSQVVKKLGQRLLPWLDRVSPEHLNPSLYLGLRL
SSLQAGAKEDLYLHGLKLDYQQCLLRSDDDNDNSECQTRPSMGQLAVYLLALRANCEFVG
GRKGDKLVSQLKWFLEDEKKAIGNDHSGQPHSSYYQYGLSILALCVHQKRVHDSVVGKLL