Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing gene and transcript missing features from tsebra.gtf for AGAT #288

Closed
romseg opened this issue Oct 12, 2022 · 8 comments · Fixed by #291
Closed

Fixing gene and transcript missing features from tsebra.gtf for AGAT #288

romseg opened this issue Oct 12, 2022 · 8 comments · Fixed by #291

Comments

@romseg
Copy link

romseg commented Oct 12, 2022

Dear author @Juke34,

I have used agat_sp_flag_short_introns.pl on my .gtf file, and I observed that the output .gff3 file generated by the script by default has additional gene and mRNA lines not present in the original gtf file. I am wondering if this is normal and adding mRNA info is expected or maybe not? I would appreciate your comments. Thank you.

The command I used:

agat_sp_flag_short_introns.pl --gff tsebra.gtf --out tsebra_flag.gff3

The input gtf file:

scaffold_1      AUGUSTUS        gene    3211    5134    .       +       .       g_19721
scaffold_1      AUGUSTUS        transcript      3211    5134    0.7     +       .       anno1.g20064.t1
scaffold_1      AUGUSTUS        start_codon     3211    3213    .       +       0       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        CDS     3211    3288    0.79    +       0       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        exon    3211    3288    .       +       .       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        intron  3289    4205    1       +       .       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        CDS     4206    4292    1       +       0       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        exon    4206    4292    .       +       .       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        intron  4293    5095    1       +       .       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        CDS     5096    5134    0.87    +       0       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        exon    5096    5134    .       +       .       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        stop_codon      5132    5134    .       +       0       transcript_id "anno1.g20064.t1"; gene_id "g_19721";
scaffold_1      AUGUSTUS        gene    11256   13613   .       +       .       g_19722
scaffold_1      AUGUSTUS        transcript      11256   13613   0.72    +       .       anno1.g20065.t1
scaffold_1      AUGUSTUS        CDS     11256   11258   0.75    +       0       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        exon    11256   11258   .       +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        start_codon     11256   11258   .       +       0       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        intron  11259   12280   1       +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        CDS     12281   12320   0.95    +       0       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        exon    12281   12320   .       +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        intron  12321   12382   0.95    +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        CDS     12383   12429   0.96    +       2       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        exon    12383   12429   .       +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        intron  12430   12615   1       +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        CDS     12616   12744   0.96    +       0       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        exon    12616   12744   .       +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        intron  12745   13595   1       +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        CDS     13596   13613   1       +       0       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        exon    13596   13613   .       +       .       transcript_id "anno1.g20065.t1"; gene_id "g_19722";
scaffold_1      AUGUSTUS        stop_codon      13611   13613   .       +       0       transcript_id "anno1.g20065.t1"; gene_id "g_19722";

The resulting output gff3 file:

##gff-version 3
scaffold_1      AUGUSTUS        gene    3211    5134    0.79    +       .       ID=g_19721;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        mRNA    3211    5134    0.79    +       .       ID=anno1.g20064.t1;Parent=g_19721;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        exon    3211    3288    .       +       .       ID=exon-1;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        exon    4206    4292    .       +       .       ID=exon-2;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        exon    5096    5134    .       +       .       ID=exon-3;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        CDS     3211    3288    0.79    +       0       ID=cds-1;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        CDS     4206    4292    1       +       0       ID=cds-2;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        CDS     5096    5134    0.87    +       0       ID=cds-3;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        intron  3289    4205    1       +       .       ID=intron-1;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        intron  4293    5095    1       +       .       ID=intron-2;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        start_codon     3211    3213    .       +       0       ID=start_codon-1;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        stop_codon      5132    5134    .       +       0       ID=stop_codon-1;Parent=anno1.g20064.t1;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        gene    3211    5134    .       +       .       ID=gene-1
scaffold_1      AUGUSTUS        transcript      3211    5134    0.7     +       .       ID=transcript-1;Parent=gene-1
scaffold_1      AUGUSTUS        gene    11256   13613   0.75    +       .       ID=g_19722;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        mRNA    11256   13613   0.75    +       .       ID=anno1.g20065.t1;Parent=g_19722;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        exon    11256   11258   .       +       .       ID=exon-4;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        exon    12281   12320   .       +       .       ID=exon-5;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        exon    12383   12429   .       +       .       ID=exon-6;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        exon    12616   12744   .       +       .       ID=exon-7;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        exon    13596   13613   .       +       .       ID=exon-8;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        CDS     11256   11258   0.75    +       0       ID=cds-4;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        CDS     12281   12320   0.95    +       0       ID=cds-5;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        CDS     12383   12429   0.96    +       2       ID=cds-6;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        CDS     12616   12744   0.96    +       0       ID=cds-7;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        CDS     13596   13613   1       +       0       ID=cds-8;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        intron  11259   12280   1       +       .       ID=intron-3;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        intron  12321   12382   0.95    +       .       ID=intron-4;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        intron  12430   12615   1       +       .       ID=intron-5;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        intron  12745   13595   1       +       .       ID=intron-6;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        start_codon     11256   11258   .       +       0       ID=start_codon-2;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        stop_codon      13611   13613   .       +       0       ID=stop_codon-2;Parent=anno1.g20065.t1;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        gene    11256   13613   .       +       .       ID=gene-2
scaffold_1      AUGUSTUS        transcript      11256   13613   0.72    +       .       ID=transcript-2;Parent=gene-2

Lines present in the output gff3 but missing in original gtf:

scaffold_1      AUGUSTUS        gene    3211    5134    0.79    +       .       ID=g_19721;gene_id=g_19721;transcript_id=anno1.g20064.t1
scaffold_1      AUGUSTUS        mRNA    3211    5134    0.79    +       .       ID=anno1.g20064.t1;Parent=g_19721;gene_id=g_19721;transcript_id=anno1.g20064.t1

scaffold_1      AUGUSTUS        gene    11256   13613   0.75    +       .       ID=g_19722;gene_id=g_19722;transcript_id=anno1.g20065.t1
scaffold_1      AUGUSTUS        mRNA    11256   13613   0.75    +       .       ID=anno1.g20065.t1;Parent=g_19722;gene_id=g_19722;transcript_id=anno1.g20065.t1

Best,
Rom

@Juke34
Copy link
Collaborator

Juke34 commented Oct 12, 2022

These lines are suposed to replace the bad formated ones:

scaffold_1      AUGUSTUS        gene    3211    5134    .       +       .       g_19721
scaffold_1      AUGUSTUS        transcript      3211    5134    0.7     +       .       anno1.g20064.t1
scaffold_1      AUGUSTUS        gene    11256   13613   .       +       .       g_19722
scaffold_1      AUGUSTUS        transcript      11256   13613   0.72    +       .       anno1.g20065.t1

Which are supposed to be dropped. I don't get why there still present...

What version of AGAT are you using?

@romseg
Copy link
Author

romseg commented Oct 12, 2022

Hi Jacques,

The version of AGAT that I am using is v0.8.0 . I installed it with singularity.

The error report is the following:

10/11/2022 at 20h21m35s

usage: /usr/local/bin/agat_sp_flag_short_introns.pl --gff tsebra.gtf --out tsebra_flag.gff3

Reading tsebra.gtf
********************************************************************************
*                              - Start parsing -                               *
********************************************************************************
-------------------------- parse options and metadata --------------------------
=> Accessing the feature level json files
        Using standard /usr/local/lib/site_perl/5.26.2/auto/share/dist/AGAT/features_level1.json file
        Using standard /usr/local/lib/site_perl/5.26.2/auto/share/dist/AGAT/features_level2.json file
        Using standard /usr/local/lib/site_perl/5.26.2/auto/share/dist/AGAT/features_level3.json file
        Using standard /usr/local/lib/site_perl/5.26.2/auto/share/dist/AGAT/features_spread.json file
=> Attribute used to group features when no Parent/ID relationship exists:
        * locus_tag
        * gene_id
=> merge_loci option deactivated
=> Accessing Ontology
        No ontology accessible from the gff file header!
        We use the SOFA ontology distributed with AGAT:
                /usr/local/lib/site_perl/5.26.2/auto/share/dist/AGAT/so.obo
        Read ontology /usr/local/lib/site_perl/5.26.2/auto/share/dist/AGAT/so.obo:
                4 root terms, and 2472 total terms, and 1436 leaf terms
        Filtering ontology:
                We found 1757 terms that are sequence_feature or is_a child of it.
-------------------------------- parse features --------------------------------
There is a problem we found several formats in this file:
1,2
Let's see what we can do...
=> GFF version parser used: 2
[Tue Oct 11 20:21:41 2022] Parsing:   0%
^M                                                                                ^Mgff3 reader error level1: No ID attribute found @ for the feature: scaffold_1       AUGUSTUS
        gene    3211    5134    .       +       .       
[Tue Oct 11 20:21:41 2022] Parsing:   0%
^M                                                                                ^Mgff3 reader error level2: No ID attribute found @ for the feature: scaffold_1       AUGUSTUS
        transcript      3211    5134    0.7     +       .       
^M                                                                                ^MWARNING level2: No Parent attribute found @ for the feature: scaffold_1     AUGUSTUS        transcript      3211    5134    0.7     +       .       ID "transcript-1" 
^M                                                                                ^MWARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the feature is:
scaffold_1      AUGUSTUS        transcript      3211    5134    0.7     +       .       ID "transcript-1" 
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
^M                                                                                ^Mgff3 reader error level1: No ID attribute found @ for the feature: scaffold_1       AUGUSTUS
        gene    11256   13613   .       +       .       
[Tue Oct 11 20:21:41 2022] Parsing:   0%
^M                                                                                ^Mgff3 reader error level2: No ID attribute found @ for the feature: scaffold_1       AUGUSTUS
        transcript      11256   13613   0.72    +       .       
^M                                                                                ^MWARNING level2: No Parent attribute found @ for the feature: scaffold_1     AUGUSTUS        transcript      11256   13613   0.72    +       .       ID "transcript-2" 
^M                                                                                ^MWARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the feature is:
scaffold_1      AUGUSTUS        transcript      11256   13613   0.72    +       .       ID "transcript-2" 
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%
[Tue Oct 11 20:21:41 2022] Parsing:   0%

I would be grateful to know your suggestions. Thank you.

Best regards,
Rom

@Juke34
Copy link
Collaborator

Juke34 commented Oct 12, 2022

Ok I think this is an issue fixe in most recent AGAT version. Please use the lastest version.

@romseg
Copy link
Author

romseg commented Oct 12, 2022

I used the latest version of AGAT v0.9.2 and the problem remains in exactly the same way.

It looks like the tsebra.gtf input files only have the IDs for gene and transcript features in column 9, but they are missing the gene_id and transcript_id tags. Do you think this is the problem? All the other features have the proper tags in column 9.

Same issue with the augustus.hint.gtf and braker.gtf files that are the default outputs of Braker. I am wondering how to fix this formatting issue, so I can use these files with AGAT. Thank you.

Rom

@Juke34
Copy link
Collaborator

Juke34 commented Oct 13, 2022

Good catch, indeed AGAT have trouble to read tsebra file.
This is a bug. I will try to fix it.

For myself:
AGAT works with

4	AUGUSTUS	gene	16086	38972	0.01	+	.	ID=g2
4	AUGUSTUS	transcript	16086	38972	0.01	+	.	ID=g2.t1;Parent=g2
4	AUGUSTUS	gene	16086	38972	0.01	+	.	g2
4	AUGUSTUS	transcript	16086	38972	0.01	+	.	ID=g2.t1;Parent=g2
4	AUGUSTUS	gene	16086	38972	0.01	+	.	g2
4	AUGUSTUS	transcript	16086	38972	0.01	+	.	ID=g2.t1

But not with

4	AUGUSTUS	gene	16086	38972	0.01	+	.	g2
4	AUGUSTUS	transcript	16086	38972	0.01	+	.	g2.t1

or

4	AUGUSTUS	gene	16086	38972	0.01	+	.	g2
4	AUGUSTUS	transcript	16086	38972	0.01	+	.	g2.t1;Parent=g2

@Juke34
Copy link
Collaborator

Juke34 commented Oct 13, 2022

For now, what you can do to quickly fix such tsebra output file for AGAT first run this awk command:

awk 'BEGIN{OFS="\t"}{if($3=="gene"){$9="gene_id "$9}; if($3=="transcript"){$9="transcript_id "$9}; print $0}' tsebra.gtf > tsebra_clean.gtf

@romseg
Copy link
Author

romseg commented Oct 13, 2022

Hi Jacques,

That's great! I used the awk command you provided and fixed the problem! Now the output gff3 file from AGAT looks clean.

After running the report file still throws the warning below for every gene, but the output file looks clean to me and it looks like AGAT handled it well.

[Thu Oct 13 12:05:11 2022] Parsing:   0%
^M                                                                                ^MWARNING level2: No Parent attribute found @ for the feature: scaffold_1     AUGUSTUS        transcript      3211    5134    0.7     +       .       ID "anno1.g20064.t1"  ; transcript_id "anno1.g20064.t1" 
^M                                                                                ^MWARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the feature is:
scaffold_1      AUGUSTUS        transcript      3211    5134    0.7     +       .       ID "anno1.g20064.t1"  ; transcript_id "anno1.g20064.t1" 
[Thu Oct 13 12:05:11 2022] Parsing:   0%

...

33676 warning messages: WARNING level2: No Parent attribute found 
33676 warning messages: WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictl
y sequential. If you disagree, please provide an ID or a comon tag by locus.

I also tested the fix with agat_sp_statistics.pl and works very well. Before the fix agat_sp_statistics.pl was producing double counts.

By the way, I am changing the title of this issue to reflect more accurately what it's being fix here. Thanks a lot for your help!

Best regards,
Rom

@romseg romseg changed the title Features of agat_sp_flag_short_introns.pl output file Fixing gene and transcript missing features from tsebra.gtf for AGAT Oct 13, 2022
@Juke34
Copy link
Collaborator

Juke34 commented Oct 22, 2022

I made some modifications in AGAT version 1. Tsebra files should be handle correctly in that version.

@Juke34 Juke34 closed this as completed Oct 22, 2022
Juke34 added a commit that referenced this issue Oct 22, 2022
* Use AppEaser module from https://github.com/polettix/App-Easer to create a multi layer help
* Add a "agat" script => can be used to modify/expose config; to expose levels.yaml; to list the tools; get agat version, etc.
* design a configuration file (config.yaml) to apply config on all scripts
* add config module (with function to check configuration file e.g.) 
* Merge feature_levels json files into one single yaml file
* Make gtf output possible to every sp script via config.yaml and create a dedicated module (OmniscientToGTF) based on code from the gff2gtf script.
* fix the script compare_annotations by rewriting it from scratch. + add tests
* Create a BioperlGFF module based on the Bioperl code to correct parse GFF/GTF files when they contain a mix of GFF1 and GFF2/GFF3 like seen in Augustus and Tsebra output files (fix #288).
* Modify name of the Module Omniscient by AGAT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants