Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error GFF file #24

Closed
ireneortega opened this issue Apr 17, 2020 · 11 comments
Closed

Error GFF file #24

ireneortega opened this issue Apr 17, 2020 · 11 comments

Comments

@ireneortega
Copy link

ireneortega commented Apr 17, 2020

I am having trouble with GFF file. First, it says "line 1924: 9 fields are expected in each line". I think the problem is that this file contains contig sequences, so I deleted them and just keep annotation information. Not sure if this was the problem, could be?

But, then it says "Protein id xxxxxx is not in the .gff file"
F1021_gff_file.txt
(this is a fragment of the GFF file as this format is not supported in an attacfed file)

F1021_protein.txt

I read the information in "Known issues" but I still don't know how the .gff file should look like. Could you please tell me how it shoud be the 9 th field in the attached .gff example file?

Thanks!

@vbrover
Copy link
Contributor

vbrover commented Apr 17, 2020

I have removed the last line with the word "(continue)" from F1021_gff_file.txt and then

cat F1021_gff_file.txt | sed 's/;locus_tag=/;Name=/1' > aa.gff

Then AMRFinder worked:

amrfinder -p F1021_protein.txt -g aa.gff 

For the format of the GFF file see https://github.com/ncbi/amr/wiki/Running-AMRFinderPlus#input-file-formats.

How was this GFF file created?

@evolarjun
Copy link
Contributor

Hi Irene,

We've heard of this issue before with regards to Prokka appending the assembly to the GFF. So I'm guessing you're using Prokka. As you discerned, you have to chop it off before passing the GFF to amrfinder. If I've guessed right about Prokka or the annotation output you're dealing with is in the same format, here's a couple of perl one liners that have worked for Prokka output before.

This should get you a GFF file that will work for AMRFinderPlus:

perl -pe '/^##FASTA/ && exit; s/(\W)Name=/$1OldName=/i; s/ID=([^;]+)/ID=$1;Name=$1/' <prokka_output.gff>  > <for_amrfinder.gff>

If you don't have the nucleotide FASTA you can use this to get it from the Prokka GFF

perl -ne 'print if ($p); /^##FASTA/ && $p++' <prokka_output.gff> > <for_amrfinder.fna>

Then you can run AMRFinderPlus in full combined mode like:

amrfinder -p <protein.fa> -n <for_amrfinder.fna> -g <for_amrfinder.gff> > amrfinder_output.tsv

Thanks for posting the issue, and please let us know if the above one-liners work for you.

If they do I'll add them to the documentation so other people won't have the same issue. There's an example GFF file distributed with the software if that will help (https://github.com/ncbi/amr/blob/master/test_prot.gff), I will see if I can improve the documentation of the GFF file format.

Arjun

@ireneortega
Copy link
Author

ireneortega commented Apr 18, 2020

Hi Arjun,

Yes, you were right, the genome was annotated with PROKKA. The first perl command worked for me and so AMRFinderPlus in full combined mode. But the second perl command just created an empty .fna file. I encourage you to improve the documentation of the GFF file format with that to help other users.

But now, I want to identify known and unknown mutations. Does AMRFinder find both or just know mutations? The report generated with --mutation_all shows many mutations that are not shown in the output file. I don't know how it works as CmeR mutations are not shown in the output file even in the point mutation report for the specific organism Coverage of reference sequence is 100 % and Identity to reference sequence is 99,05 %, could they be unknown mutations?

Thanks for you help and for keeping this tool updated!

Irene

@evolarjun
Copy link
Contributor

Hi Irene,

I'm not sure why the second perl script didn't work, but possibly it's because you stripped out the assembly before passing the GFF file to the one liner.

AMRFinderPlus does not report mutations that are not in its database. Because it is specifically designed to probe for sets of curated genes and curated known resistance mutations, it will only probe for genes and mutations in the PD Reference Gene Database. The --mutation-all option is designed to differentiate between identifying a known resistance associated mutation, identifying the database variant, or identifying an alternate residue at that site; and not finding the site at all. It does not call mutations at other sites in the gene or at other genes in the genome.

I will discuss with the team about possibly adding an option to identify all differences from the reference protein, but until now AMRFinderPlus has been very focused on only identifying resistance-associated elements known in wild bacteria. At this point we are not including laboratory induced mutations in the database.

If there is a published account of a resistance-associated mutation that we do not include, we could have missed it, please let us know if that's the case.

You can see what genes/mutations are probed by AMRFinderPlus by looking in the PD Reference Gene Catalog at https://www.ncbi.nlm.nih.gov/pathogens/isolates#/refgene/

In addition to what's in the Reference Gene Catalog, the AMRFinderPlus database includes HMMs and a tree structure for the genes, but those aren't relevant for point mutation identification. We only have one mutation in cmeR for Campylobacter: cmeR_G86A, so if your assembly has that mutation and AMRFinderPlus is not detecting it, then we may have a bug. Other that AMRFinderPlus shouldn't be reporting novel sites.

Thanks for your interest and let us know if you have more questions.
Arjun

@ireneortega
Copy link
Author

Hi Arjun,

Up to now, AMRFinder satisfies my desires and I will use it in combination with other tools to find unknown mutations. Thanks!!

Irene

@neelam19051
Copy link

Hi i had change my prokka_gff file by using this commonds - perl -pe '/^##FASTA/ && exit; s/(\W)Name=/$1OldName=/i; s/ID=([^;]+)/ID=$1;Name=$1/' <prokka_output.gff> > <for_amrfinder.gff> but still i am getting same error. what should i do?

_###

amrfinder -p P_aeruginosa_ZPPH33.faa -g P_aeruginosa_ZPPH33_a.gff -n P_aeruginosa_ZPPH33.fna -O Pseudomonas_aeruginosa --plus

_
Running: amrfinder -p P_aeruginosa_ZPPH33.faa -g P_aeruginosa_ZPPH33_a.gff -n P_aeruginosa_ZPPH33.fna -O Pseudomonas_aeruginosa --plus
Software directory: '/home/bvs/anaconda3/envs/myenv/bin/'
Software version: 3.10.30
Database directory: '/home/bvs/anaconda3/envs/myenv/share/amrfinderplus/data/2022-05-26.1'
Database version: 2022-05-26.1
AMRFinder combined translated and protein and mutation search

1**. > ### GFF file mismatch.
2. > *** ERROR ***
3. > gff_check.cpp: Protein id "JMCBFLMO_00001_Chromosomal_replication_initiator_protein_DnaA" is not in the .gff-file
**
4.

HOSTNAME: ?
SHELL: /bin/bash
PWD: /home/bvs/neelam/annotated/annotated/amrfinder_all
PATH: /home/bvs/anaconda3/envs/myenv/bin:/home/bvs/.local/bin:/home/bvs/bin:/home/bvs/anaconda3/condabin:/home/bvs/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin
Progam name: gff_check
Command line: /home/bvs/anaconda3/envs/myenv/bin/gff_check P_aeruginosa_ZPPH33_a.gff -prot P_aeruginosa_ZPPH33.faa -dna P_aeruginosa_ZPPH33.fna -log /tmp/amrfinder.UxeqP3.log

Thank you!

@vbrover
Copy link
Contributor

vbrover commented Jun 24, 2022

Could you attach the files?

P_aeruginosa_ZPPH33.faa 
P_aeruginosa_ZPPH33_a.gff
P_aeruginosa_ZPPH33.fna

@neelam19051
Copy link

Hi, i am attaching file here please have a look.

P_aeruginosa_ZPPH33.zip

@vbrover
Copy link
Contributor

vbrover commented Jun 27, 2022

Thank you!

The goal of a .gff-file is to link the .faa- and .fna-files.
The software creating the .gff-files must use the sequence identifiers from the .faa- and .fna-files.

I have done this:

sed 's/^>\([^_]\+_[^_]\+\)_/>\1 /1'  P_aeruginosa_ZPPH33.faa > aa

Then this worked:

amrfinder  -p aa  -g P_aeruginosa_ZPPH33.gff  -n P_aeruginosa_ZPPH33.fna 

@neelam19051
Copy link

neelam19051 commented Jul 2, 2022

Hi, First of all thank you for your time, actually it work when i run individually by using above command but it shows some error when i am trying to run in loop on multiple file and each gff give different error.

*** ERROR ***
Protein sequence looks like a nucleotide sequence

HOSTNAME: ?
SHELL: /bin/bash
PWD: /home/bvs/neelam/annotated/annotated/amrfinder_all
PATH: /home/bvs/anaconda3/envs/myenv/bin:/home/bvs/.local/bin:/home/bvs/bin:/home/bvs/anaconda3/condabin:/home/bvs/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin
Progam name: fasta_check
Command line: /home/bvs/anaconda3/envs/myenv/bin/fasta_check P_aeruginosa_KBP_PA_F19.fna -aa -log /tmp/amrfinder.fH4Vy1.log
P_aeruginosa_KBP_PA_F19.gff
Running: amrfinder -p P_aeruginosa_KBP_PA_F19.gff -g P_aeruginosa_KBP_PA_F19.gff -n P_aeruginosa_KBP_PA_F19.gff
Software directory: '/home/bvs/anaconda3/envs/myenv/bin/'
Software version: 3.10.30
Database directory: '/home/bvs/anaconda3/envs/myenv/share/amrfinderplus/data/2022-05-26.1'
Database version: 2022-05-26.1
AMRFinder combined translated and protein search

  • include -O ORGANISM, --organism ORGANISM option to add mutation searches and suppress common proteins

*** ERROR ***
File P_aeruginosa_KBP_PA_F19.gff, line 1: FASTA should start with '>'

HOSTNAME: ?
SHELL: /bin/bash
PWD: /home/bvs/neelam/annotated/annotated/amrfinder_all
PATH: /home/bvs/anaconda3/envs/myenv/bin:/home/bvs/.local/bin:/home/bvs/bin:/home/bvs/anaconda3/condabin:/home/bvs/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin
Progam name: fasta_check
Command line: /home/bvs/anaconda3/envs/myenv/bin/fasta_check P_aeruginosa_KBP_PA_F19.gff -aa -log /tmp/amrfinder.2EXvS5.log
P_aeruginosa_KCP_1.faa
Running: amrfinder -p P_aeruginosa_KCP_1.faa -g P_aeruginosa_KCP_1.faa -n P_aeruginosa_KCP_1.faa
Software directory: '/home/bvs/anaconda3/envs/myenv/bin/'
Software version: 3.10.30
Database directory: '/home/bvs/anaconda3/envs/myenv/share/amrfinderplus/data/2022-05-26.1'
Database version: 2022-05-26.1
AMRFinder combined translated and protein search

  • include -O ORGANISM, --organism ORGANISM option to add mutation searches and suppress common proteins

GFF file mismatch.
*** ERROR ***
File P_aeruginosa_KCP_1.faa, line 1: 9 fields are expected in each line

Thank you!

@vbrover
Copy link
Contributor

vbrover commented Jul 2, 2022

Running: amrfinder -p P_aeruginosa_KBP_PA_F19.gff -g P_aeruginosa_KBP_PA_F19.gff -n P_aeruginosa_KBP_PA_F19.gff

It should be

amrfinder  -p P_aeruginosa_KBP_PA_F19.faa  -g P_aeruginosa_KBP_PA_F19.gff  -n P_aeruginosa_KBP_PA_F19.fna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants