Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agat_sp_extract_sequences.pl WARNINGs & 13 exons created that were missing #273

Closed
spoonbender76 opened this issue Jul 26, 2022 · 1 comment

Comments

@spoonbender76
Copy link

spoonbender76 commented Jul 26, 2022

Describe the bug
I was trying to extract exon fasta from a gff3 annotation and reference genome by agat_sp_extract_sequences.pl -f GCF_014356525.1_ASM1435652v1_genomic.fna -g GCF_014356525.1_ASM1435652v1_genomic.gff -t exon --split -o nl_exons.fasta.

AGAT was like:

WARNING level1: This feature level1 is not a duplicate but has an ID already used (original id: 9a9815af-ec13-4e6f-abd4-965eb5fad7f5).
AGAT does not deal with that currently. @ the feature is:
NC_052504.1	RefSeq	cDNA_match	235206	235334	.	-	.	ID "nbis-cdna_match-1"  ; Target "XM_039426992.1 222 350 +"  ; for_remapping 2 ; gap_count 1 ; num_ident 1938 ; num_mismatch 0 ; pct_coverage 100 ; pct_coverage_hiqual 100 ; pct_identity_gap "99.9484"  ; pct_identity_ungap 100 ; rank 1
Indeed we changed the ID for this feature l1 to be uniq but we do not change
the parent attribute for the subfeatures because we do not know to which L1
they are really linked to. As it is now we will probably end up with chimeric records.
8 warning messages: WARNING level1: This feature level1 is not a duplicate but has an ID already used (original id: 9a9815af-ec13-4e6f-abd4-965eb5fad7f5).
AGAT does not deal with that currently. 

and in the end I noticed this:

----------------------------- Check9: check exons ------------------------------
13 exons created that were missing
No exons locations modified
No supernumerary exons removed
No level2 locations modified

I wonder if it's an AGAT bug that causes these warning messages and some exons to be missing. Any advice would be appreciated.

General (please complete the following information):

  • AGAT version v0.9.2
  • AGAT installation/use Conda
  • OS: Ubuntu 22.04 LTS

To Reproduce
I downloaded genomic fna and gff files from ncbi and unzipped them.
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Nilaparvata_lugens/latest_assembly_versions/GCF_014356525.1_ASM1435652v1/GCF_014356525.1_ASM1435652v1_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Nilaparvata_lugens/latest_assembly_versions/GCF_014356525.1_ASM1435652v1/GCF_014356525.1_ASM1435652v1_genomic.gff.gz

AGAT v0.9.2 installation by Conda.

And I run AGAT with this command
agat_sp_extract_sequences.pl -f GCF_014356525.1_ASM1435652v1_genomic.fna -g GCF_014356525.1_ASM1435652v1_genomic.gff -t exon --split -o nl_exons.fasta > agat.log
agat.log

@Juke34
Copy link
Collaborator

Juke34 commented Jul 26, 2022

Hi,
Thank you for using AGAT.

Yes it is because the "cDNA_match" are level1 feature (l1 feature does expect child features linked to it, thus need to be identifieable with a uniq ID).
Let see with this example why AGAT tell you to be carefull:

NC_052504.1	RefSeq	cDNA_match	35	35	.	-	.	ID=common_name
NC_052504.1	RefSeq	match_part	35	35	.	-	.	ID=aaa;Parent=common_name
NC_052504.1	RefSeq	cDNA_match	353	353	.	-	.	ID=common_name
NC_052504.1	RefSeq	match_part	353	353	.	-	.	ID=bbb;Parent=common_name
NC_052504.1	RefSeq	cDNA_match	3515	3517	.	-	.	ID=common_name
NC_052504.1	RefSeq	match_part	3515	3517	.	-	.	ID=ccc;Parent=common_name

Will give

NC_052504.1	RefSeq	cDNA_match	35	3517	.	-	.	ID=39f6b45b-7145-4824-9802-df0a3cb90b6e
NC_052504.1	RefSeq	match_part	35	35	.	-	.	ID=baba;Parent=39f6b45b-7145-4824-9802-df0a3cb90b6e
NC_052504.1	RefSeq	match_part	353	353	.	-	.	ID=bobo;Parent=39f6b45b-7145-4824-9802-df0a3cb90b6e
NC_052504.1	RefSeq	match_part	3515	3517	.	-	.	ID=bibi;Parent=39f6b45b-7145-4824-9802-df0a3cb90b6e

So you will create a chimere group of feature of cDNA_match, instead to get 3 cDNA_match group of features with each its own child.

The cDNA_match expect a child which does exist in your file. Consequetly AGAT get rid of them. As you are interested in exon and none are present representing the cDNA_match, you can ignore the message.

In case you want to parse those feature properly, you will have to handle them as level2 feature. Either you change the feature level json file by moving cDNA_match from level1 to level2 file, either you change cDNA_match by match_part in the input file with a sed command. But in both case you will encounter another problem, you do not have any Parent attribute to gather the features together except the ID. Either you have to get those feature in a separate file and parse them using agat_convert_sp_gxf2gxf.pl -gff infile.gff -c ID -o output.gff then put the result back with the other features; or (maybe easier) again with a awk or sed command you change the ID attribute by Parent (but only for the cDNA_match feature).

Juke34 pushed a commit that referenced this issue Jul 26, 2022
@Juke34 Juke34 closed this as completed Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants