Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add command to expand xrefs section in GFF3 files #483

Open
cmungall opened this issue Jan 4, 2024 · 2 comments
Open

Add command to expand xrefs section in GFF3 files #483

cmungall opened this issue Jan 4, 2024 · 2 comments

Comments

@cmungall
Copy link
Collaborator

cmungall commented Jan 4, 2024

I am 97% sure this is out of scope for sssom-py and this should be either it's own tool or something as part of a general gff package. But this seems like a good place to seed the idea.

GFF allows various kinds of annotations in column 9, many of these are CURIEs. It's often useful to expand these. E.g. a gene annotated with an EC by prokka could be expanded to a GO annotation using a GO sssom file.

@cthoyt
Copy link
Member

cthoyt commented Jan 5, 2024

Can you link to an example of a GFF file please?

@cmungall
Copy link
Collaborator Author

cmungall commented Jan 5, 2024

Here is the first few lines of the output of prokka run on a metagenomic sample (downloaded from here in NMDC).

Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     18      563     72.59   -       0       ID=Ga0495479_0000001_18_563;translation_table=11;start_type=ATG;product=5-methylcytosine-specific restriction endonuclease McrA;product_source=COG1403;cog=COG1403;pfam=PF14279
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     692     1357    88.17   -       0       ID=Ga0495479_0000001_692_1357;translation_table=11;start_type=ATG;product=phospholipase/carboxylesterase;product_source=KO:K06999;cath_funfam=3.40.50.1820;cog=COG0400;ko=KO:K06999;pfam=PF02230;superfamily=53474
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     1415    2068    95.20   -       0       ID=Ga0495479_0000001_1415_2068;translation_table=11;start_type=ATG;product=DNA-3-methyladenine glycosylase II;product_source=KO:K01247;cath_funfam=1.10.1670.10,1.10.340.30;cog=COG0122;ko=KO:K01247;ec_number=EC:3.2.2.21;pfam=PF00730;smart=SM00478;superfamily=48150
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     2223    3116    110.08  +       0       ID=Ga0495479_0000001_2223_3116;translation_table=11;start_type=ATG;product=glutamyl-Q tRNA(Asp) synthetase;product_source=KO:K01894;cath_funfam=3.40.50.620;cog=COG0008;ko=KO:K01894;ec_number=EC:6.1.1.-;pfam=PF00749;superfamily=52374
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     3293    4492    183.16  +       0       ID=Ga0495479_0000001_3293_4492;translation_table=11;start_type=ATG;product=CheY-like chemotaxis protein;product_source=COG0784;cath_funfam=1.10.287.130,3.30.565.10,3.40.50.2300;cog=COG0784;pfam=PF00072,PF00512,PF02518;smart=SM00387,SM00388,SM00448;superfamily=47384,55874
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     4632    6602    342.80  -       0       ID=Ga0495479_0000001_4632_6602;translation_table=11;start_type=ATG;product=(2R)-ethylmalonyl-CoA mutase;product_source=KO:K14447;cath_funfam=3.20.20.240,3.40.50.280;cog=COG1884,COG2185;ko=KO:K14447;pfam=PF01642,PF02310;superfamily=51703,52242;tigrfam=TIGR00640,TIGR00641
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     6630    6881    34.32   -       0       ID=Ga0495479_0000001_6630_6881;translation_table=11;start_type=ATG;product=uncharacterized membrane protein YeaQ/YmgE (transglycosylase-associated protein family);product_source=COG2261;cog=COG2261;pfam=PF04226
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     7044    7304    41.09   -       0       ID=Ga0495479_0000001_7044_7304;translation_table=11;start_type=GTG;product=uncharacterized membrane protein YeaQ/YmgE (transglycosylase-associated protein family);product_source=COG2261;cog=COG2261;pfam=PF04226

GFF doesn't have a particularly formal way of ensuring identifiers are unambiguous. In some flavours of GFF you will see bona fide CURIEs, sometimes it's somewhat implicit from the key (e.g. cog, pfam, ec_number, ...). See this preprint for recommendations on improving this situation.

Now I look at the prokka file again I see that it's not even using the recommended Ontology_term attribute, so this is looking more like some kind of bespoke gff tool that takes into account multiple idiosyncracies, definitely outside sssom-py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants