Add import feature for user-provided regions and/or features #250

oschwengers · 2023-10-24T07:42:26Z

As this gets asked more & more often (#216 #245 #247 ), I'm thinking of adding this as a new larger feature to Bakta.

At first, this is a mere reservoir for ideas and requirements of this new feature - active early feedback is highly welcome!

! So far, I cannot make any promises if and when this will be available.

Based on the feedback so far, currently, a first sketch looks like this:

2 new mutually exclusive options accepting a user regions/features either in GFF3 or Genbank format:
- --import-regions to import feature regions without annotations
- --import-features to import entire features with annotations
As a starter (and maybe permanently), this will import CDS features, only
user-provided ...
- regions will supersede de novo predicted regions
- features will supersede de novo predicted regions and annotations

Any thoughts, ideas, comments? Please, let us know what you think.

The text was updated successfully, but these errors were encountered:

marade · 2023-10-24T12:44:47Z

For my purposes (#216), the proposed design using GFF3 CDS features would appear to work well. While ORF callers are quite good these days, sometimes they miss features where we have strong biological evidence for their existence.

oschwengers · 2023-11-17T15:34:49Z

After some further considerations, I decided to keep this simpler and go for a mere import ofo CDS regions w/o any functional annotations.

So, now, there's a new parameter --regions <file> accepting a priori CDS in GFF3 or GenBank file format:

Currently, only CDS features are supported. This might be expanded as required/requested
User-provided a priori CDS regions supersede de novo-predicted CDS.
User-provided a priori CDS regions are subject to the regular internal functional annotation process. Complementary functional information can be provided as user-proteins via --proteins <file>.
A maximum overlap with de novo-predicted CDS of 30 bps is allowed. This might be subject to future changes.

marade · 2023-11-20T22:39:23Z

Attempting to use this, I find I'm wanting Bakta to search for my sequences instead of having to do it myself. Say I have several hundred genomes to annotate, and I know geneX exists in many of them, but geneX tends to not get annotated. If I understand correctly, under the current scheme I have to go find the coordinates for geneX in each of those genomes and then make a supplemental GFF for each genome, and then supply that GFF for --regions when I run Bakta. What I'd rather do is feed Bakta the sequence for geneX, and if a sufficiently homologous match is found it gets added as a user CDS in the way described above.

oschwengers · 2023-11-21T08:42:30Z

Thanks @marade for the clarification. Now, I see your point. However, I've read and understood your use-case above and in #216 in the way that importing a priori-annotated CDS regions is important to allow for amended regional annotations in single genomes. This new feature now allows for such manual annotations. However, as you already mentioned, of course these coordinates must be provided for each single genome. Even in clonal genomes, gene positions can (and often will) slightly differ.

So, if I understand your post correctly, you're interested in inferring CDS simply by homology without de novo-prediction. This could also be done, but in general, this should be handled with care since you cannot now if this is a proper functional gene. De novo gene prediction tools take into account further information as for example genetic neighborhood, ribosomal binding sites, etc.

So, in principle, it's possible and not to complicated to implement and add such a feature, too. But there are several non-trivial questions arising from that:

How much of a gene must be present to annotate it (query/subject coverage, identity)?
Shall only valid CDS with start and stop codons be taken into account?
What about InDels and frameshifts?
What about overlaps with de novo-predicted genes? Yes/no? If yes, how much overlap in which frame is allowed?

Therefore, I'm reluctant to implement this simply b/c there are so many different parameters to either anticipate as a default or ask from the user.

But, what about an external script that can be executed before Bakta? This could use tblastx or diamond blastx on a given set of protein sequences, go through the decision process described above and finally provide detected CDS in a GFF3 file that can be fed into Bakta via --regions.

One huge advantage would be, that the various parameters that would be required to adopt this to different use cases can be added w/o overcrowding Bakta's UI.
I'd be happy to accept a PR for such an accompanying script.

marade · 2023-11-21T22:56:18Z

I don't love this solution, but here's a script to try. Please have a look when you get a chance.

thorellk · 2023-11-24T14:54:25Z

My use case would be that I have a lot of ncbi-annotated genomes where I for consistency want to continue using the same locus tags and CDS coordinates as in the ncbi gff files but improve the hopelessly bad annotation using my own curated reference protein fasta file. I will try the --region option, which sounds great but ideally I would also like an option to disable the de novo CDS prediction by pyrodigal, I can see when this could be useful but in my case it is redundant.

thorellk · 2023-11-26T20:02:28Z

I tried the --regions option but got an error message.

Traceback (most recent call last):
  File "/proj/uppstore2017270/conda_envs/bakta_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/proj/uppstore2017270/conda_envs/bakta_env/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/crex/proj/uppstore2017270/common_tools/bakta/bakta/main.py", line 619, in <module>
    main()
  File "/crex/proj/uppstore2017270/common_tools/bakta/bakta/main.py", line 245, in main
    imported_cdss = feat_cds.import_user_cdss(genome, cfg.regions)
  File "/crex/proj/uppstore2017270/common_tools/bakta/bakta/features/cds.py", line 193, in import_user_cdss
    contigs_by_original_id = {c['orig_id']: c for c in genome['contigs']}
  File "/crex/proj/uppstore2017270/common_tools/bakta/bakta/features/cds.py", line 193, in <dictcomp>
    contigs_by_original_id = {c['orig_id']: c for c in genome['contigs']}
KeyError: 'orig_id'

I have tried to figure out what it means but haven't managed to solve it. I attach my runfile including the log file, and input fasta and gff file.

231124_bakta_reannot_HpGP_ncbi.sh.txt
HpGP-26695-ATCC.fsa.txt
HpGP-26695-ATCC.gff.txt

oschwengers · 2023-11-27T07:29:28Z

Hi @thorellk.
thanks for reporting. I'll add a more verbose error message. In this case, your Fasta sequence has a wrong ID. Your GFF describes features for a sequence with ID CP079087, but in your Fasta, there's onlya sequence with ID Helicobacter.
I guess your Fasta header >Helicobacter pylori is wrong.

Just change it to >CP079087 Helicobacter pylori and it should run as expected.

oschwengers · 2023-11-27T16:01:19Z

Though the homology-based automated lookup of user-provided features is still open, I'd see the initial use-case addressed and covered. Therefore, I'd like to close it this for now.

To followup on the homology based lookups, please either use #260 or #247.
@thorellk If the described bug remains, please do not hesitate to re-open this.

Thanks a lot for all these contributions!

thorellk · 2023-11-27T20:31:15Z

Just change it to >CP079087 Helicobacter pylori and it should run as expected.

Hi @oschwengers
Actually, I expected this inconsistency between fasta and gff contig id to lead to problems and if you look into the bash script that I ran above I actually already renamed the fasta header similarily to what you suggested. I cloned the repository now and still get the same error message, even if the fasta file has header >CP079087...


HpGP-26695-ATCC has chromosome id CP079087

parse genome sequences...
        imported: 1
        filtered & revised: 1
        contigs: 1

start annotation...
predict tRNAs...
        found: 37
predict tmRNAs...
        found: 1
predict rRNAs...
        found: 6
predict ncRNAs...
        found: 10
predict ncRNA regions...
        found: 1
predict CRISPR arrays...
        found: 0
predict & annotate CDSs...
        predicted: 1573 
        discarded spurious: 0
        revised translational exceptions: 0
Traceback (most recent call last):
  File "/proj/uppstore2017270/conda_envs/bakta_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/proj/uppstore2017270/conda_envs/bakta_env/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/crex/proj/uppstore2017270/common_tools/bakta/bakta/main.py", line 619, in <module>
    main()
  File "/crex/proj/uppstore2017270/common_tools/bakta/bakta/main.py", line 245, in main
    imported_cdss = feat_cds.import_user_cdss(genome, cfg.regions)
  File "/crex/proj/uppstore2017270/common_tools/bakta/bakta/features/cds.py", line 193, in import_user_cdss
    contigs_by_original_id = {c['orig_id']: c for c in genome['contigs']}
  File "/crex/proj/uppstore2017270/common_tools/bakta/bakta/features/cds.py", line 193, in <dictcomp>
    contigs_by_original_id = {c['orig_id']: c for c in genome['contigs']}
KeyError: 'orig_id'

…ig-headers #250

oschwengers · 2023-11-28T12:32:31Z

@thorellk, mea culpa I've overlooked the --keep-contig-headers situation in which there is no original contig id (orig_id). This should be fixed by d3d7a98.
Could you please confirm this? I also added another CI test to cover these cases.

thorellk · 2023-12-01T09:35:36Z

Hi @oschwengers, now I don't get that error anymore but instead it's complaining about my gff files. The Gff files are from the NCBI PGAP pipeline and should be fairly standard. Unfortunately the error message is very general so I don't know how to troubleshoot. The files are still the ones that I attached above.

HpGP-26695-ATCC has chromosome id CP079087
parse genome sequences...
        imported: 1
        filtered & revised: 1
        contigs: 1

start annotation...
predict tRNAs...
        found: 37
predict tmRNAs...
        found: 1
predict rRNAs...
        found: 6
predict ncRNAs...
        found: 10
predict ncRNA regions...
        found: 1
predict CRISPR arrays...
        found: 0
predict & annotate CDSs...
        predicted: 1573 
        discarded spurious: 0
        revised translational exceptions: 0
ERROR: User-provided regions/features file GFF3 format not valid!

oschwengers · 2023-12-01T10:13:40Z

OK, after renaming the fsa header to CP079087 I can reproduce this error. The logs provide further information:

11:08:20.753 - ERROR - CDS - user-provided CDS could not be translated into a valid amino acid sequence! contig=contig_1, start=50878, stop=52099, cds=ATTGTTGCTTGTTTCTTGCTTTTTAACGCTATTGACCCTTTTAATTTAGGGGTGTTGTTGAGCCGTTTCCAAATTAAAAATGGTTGTATTTATGGGGTGTGTTCTTATAAGGCTTCAAAATCTGTCTATGGCTATGAAGAAAGCAAAGCACAGGTTTTAAACGCTCTCAATACTTTAAGCGTGCATCCAATTTGGCAATCCAATCAAGAAAGCGTTACAAAAATCAAAGGAACTTTTGTTTTCATTTTAGAAAACGACTTGCATTTAGACGAAAACTCTTTTTACAAGAAACTTTTAAACTCGCTCATAGACAACGATTTTTTTAACCGCTCCCATTCAATGACCCCCAATCAAAAACGCTTTTTGAGCGGCTTTTTTGAAAGCAGGGGCAGCATTGATACGCAACGAAATTTTTTGACTTTAGATTACTTCTTTCATAGCCCTTTAGAGTTTAAAAAGTTCCATTATTTAATTGATTTTTTCAATATCCCTAGCGAAGCGCTGAATTTCAATTTCAGGGAATTACAGCCTGAATACGCGCAAGGCATTAACCAACGAAACGCTCAATTCAGGATTTATTTAGATTGGTATTTACACCATATCGGTCTGTTTAACCCTTATAAAGCGCGAATCGCTGAACATGTTTTTAAAACCACTCTTGCTCATGATGGCATTTATTATAAATTAAACTACCCGCCAACAACAAAGTATCATGGTAATAGCTTTACAGAATGCGCTCATTTTTATTTGAAAAACATTTATCAACAGGATTTAGATGATAAAAGCATTGAAAAATTAAGGGAGCAGTTAGGCTTTATTCAAAAGAGCGAGGAGTTTAGACGAGATAGCAAAATCATCAATCTTTATCGCCTTTCAACGCCTAATGTTTGCAGTGCATGCTGCGATGATTACGACATTAAAGAAAGAAGTTTTCTTTCTTTACCTTTATATCAAATCACTCAAAATCCCGATTCCTACTACACTGAAATACATGATTTCTTTAGGCAAAATCAGAGAATTAGATGTTTTAGCAAATCTTGCTAAACTTTGCCCTACTTGTCATAGGGCTTTAAAAAAAGGATCTAGCGAAGAGGAGTTTCAAAAACGCTTGATTAGAAACATTCTCAATCGCAATAAAGACAATTTAGAGTTTGCGCAATTGCGTTTTGAAACCGATGATTTTTCAACGCTTATTGATCGTATTTGTGAAAGCTTGAAATGA
11:08:20.753 - ERROR - CDS - user-provided regions/features file GFF3 format not valid!
Traceback (most recent call last):
  File "/home/oliver/miniconda3/lib/python3.10/site-packages/bakta/features/cds.py", line 224, in import_user_cdss
    aa = str(Seq(nt).translate(table=cfg.translation_table, cds=True))
  File "/home/oliver/miniconda3/lib/python3.10/site-packages/Bio/Seq.py", line 1448, in translate
    _translate_str(str(self), table, stop_symbol, to_stop, cds, gap=gap)
  File "/home/oliver/miniconda3/lib/python3.10/site-packages/Bio/Seq.py", line 2792, in _translate_str
    raise CodonTable.TranslationError(
Bio.Data.CodonTable.TranslationError: Sequence length 1222 is not a multiple of three

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/oliver/miniconda3/lib/python3.10/site-packages/bakta/features/cds.py", line 229, in import_user_cdss
    raise ValueError(f"User-provided CDS could not be translated into a valid amino acid sequence! contig={user_cds['contig']}, start={user_cds['start']}, stop={user_cds['stop']}, cds={nt}")
ValueError: User-provided CDS could not be translated into a valid amino acid sequence! contig=contig_1, start=50878, stop=52099, cds=ATTGTTGCTTGTTTCTTGCTTTTTAACGCTATTGACCCTTTTAATTTAGGGGTGTTGTTGAGCCGTTTCCAAATTAAAAATGGTTGTATTTATGGGGTGTGTTCTTATAAGGCTTCAAAATCTGTCTATGGCTATGAAGAAAGCAAAGCACAGGTTTTAAACGCTCTCAATACTTTAAGCGTGCATCCAATTTGGCAATCCAATCAAGAAAGCGTTACAAAAATCAAAGGAACTTTTGTTTTCATTTTAGAAAACGACTTGCATTTAGACGAAAACTCTTTTTACAAGAAACTTTTAAACTCGCTCATAGACAACGATTTTTTTAACCGCTCCCATTCAATGACCCCCAATCAAAAACGCTTTTTGAGCGGCTTTTTTGAAAGCAGGGGCAGCATTGATACGCAACGAAATTTTTTGACTTTAGATTACTTCTTTCATAGCCCTTTAGAGTTTAAAAAGTTCCATTATTTAATTGATTTTTTCAATATCCCTAGCGAAGCGCTGAATTTCAATTTCAGGGAATTACAGCCTGAATACGCGCAAGGCATTAACCAACGAAACGCTCAATTCAGGATTTATTTAGATTGGTATTTACACCATATCGGTCTGTTTAACCCTTATAAAGCGCGAATCGCTGAACATGTTTTTAAAACCACTCTTGCTCATGATGGCATTTATTATAAATTAAACTACCCGCCAACAACAAAGTATCATGGTAATAGCTTTACAGAATGCGCTCATTTTTATTTGAAAAACATTTATCAACAGGATTTAGATGATAAAAGCATTGAAAAATTAAGGGAGCAGTTAGGCTTTATTCAAAAGAGCGAGGAGTTTAGACGAGATAGCAAAATCATCAATCTTTATCGCCTTTCAACGCCTAATGTTTGCAGTGCATGCTGCGATGATTACGACATTAAAGAAAGAAGTTTTCTTTCTTTACCTTTATATCAAATCACTCAAAATCCCGATTCCTACTACACTGAAATACATGATTTCTTTAGGCAAAATCAGAGAATTAGATGTTTTAGCAAATCTTGCTAAACTTTGCCCTACTTGTCATAGGGCTTTAAAAAAAGGATCTAGCGAAGAGGAGTTTCAAAAACGCTTGATTAGAAACATTCTCAATCGCAATAAAGACAATTTAGAGTTTGCGCAATTGCGTTTTGAAACCGATGATTTTTCAACGCTTATTGATCGTATTTGTGAAAGCTTGAAATGA

So there's a gene -[50878, 52099] which has a coding sequence which is not a multiple of 3 and thus causes this error:

CP079087        GenBank gene    50878   52099   .       -       .       ID=KVE98_00260;locus_tag=KVE98_00260
CP079087        GenBank mRNA    50878   52099   .       -       .       ID=KVE98_00260.mRNA.0;Parent=KVE98_00260
CP079087        GenBank CDS     50878   52099   .       -       0       ID=KVE98_00260.mRNA.0.CDS.1;Parent=KVE98_00260.mRNA.0
CP079087        GenBank exon    50878   52099   .       -       .       ID=KVE98_00260.mRNA.0.exon.1;Parent=KVE98_00260.mRNA.0
CP079087        GenBank polypeptide     50878   52099   .       -       .       ID=KVE98_00260.polypeptide.0;Parent=KVE98_00260.mRNA.0;product_name=HNH endonuclease

After removing this CDS, there are more of these. As far as I know, a CDS should always consist of triplets.

thorellk · 2023-12-03T13:39:17Z

Yes, one would definitely expect CDS to contain even triplets. This is official NCBI PGAP annotation and I checked the accompanying protein fasta file. The fasta header for that entry is HpGP-26695-ATCC|KVE98_00260|KVE98_00260|HNH endonuclease|50878:52099 Reverse|frameshifted, which would explain the non-triplet number of bases. Is it possible to get the pipeline to just skip those entries rather than terminating?

thorellk · 2023-12-29T19:23:03Z

Hi again @oschwengers. I am sorry to push for this but do you think there is any way to work around this issue? We have several projects where we work with NCBI annotated genomes and we want to keep the gene coordinates and locus tags but improve the functional annotation. If you don't think it will be possible with bakta, do you have any other suggestion? I have tried for example liftoff but it is not at all as versatile.

thorellk · 2024-01-09T17:02:11Z

I guess this issue may have a similar solution as #262?

oschwengers · 2024-07-17T13:44:26Z

Hey @thorellk , I'm very sorry for not having responded earlier - this just somehow slipped through. Just in case this is still of interest, I think we could skip the strict triplet checks for pseudogenes, as indictated in this case by the Reverse|frameshifted tag. However, in these cases, there should be a pseudo=true attribute added in GFF3 column 9.

oschwengers added the feature label Oct 24, 2023

oschwengers added this to the Backlog milestone Oct 24, 2023

This was referenced Oct 24, 2023

When using --proteins, are the sequences only used for annotation? #216

Closed

Is it possible to import my own cds fasta file for annotation? #245

Closed

Transfer annotations from similar genome #247

Open

oschwengers mentioned this issue Nov 10, 2023

Increasing genome annotation: integrating StORF-Reporter functionality into bakta #254

Closed

oschwengers modified the milestones: Backlog, v1.9.0 Nov 17, 2023

oschwengers self-assigned this Nov 17, 2023

oschwengers mentioned this issue Nov 17, 2023

Support a priori user-provided feature regions #259

Merged

marade mentioned this issue Nov 21, 2023

add scripts/make-user-region-GFFs.py #260

Open

oschwengers added a commit that referenced this issue Nov 27, 2023

catch non-existing seq ID in provided regions #250

8822b9f

oschwengers added a commit that referenced this issue Nov 27, 2023

add region test on wrong seq ID #250

0b96422

oschwengers closed this as completed Nov 27, 2023

oschwengers added a commit that referenced this issue Nov 28, 2023

fix missing original contig id in region logic when using --keep-cont…

d3d7a98

…ig-headers #250

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add import feature for user-provided regions and/or features #250

Add import feature for user-provided regions and/or features #250

oschwengers commented Oct 24, 2023

marade commented Oct 24, 2023

oschwengers commented Nov 17, 2023 •

edited

Loading

marade commented Nov 20, 2023 •

edited

Loading

oschwengers commented Nov 21, 2023

marade commented Nov 21, 2023

thorellk commented Nov 24, 2023 •

edited

Loading

thorellk commented Nov 26, 2023

oschwengers commented Nov 27, 2023

oschwengers commented Nov 27, 2023

thorellk commented Nov 27, 2023

oschwengers commented Nov 28, 2023

thorellk commented Dec 1, 2023

oschwengers commented Dec 1, 2023

thorellk commented Dec 3, 2023

thorellk commented Dec 29, 2023

thorellk commented Jan 9, 2024

oschwengers commented Jul 17, 2024

Add import feature for user-provided regions and/or features #250

Add import feature for user-provided regions and/or features #250

Comments

oschwengers commented Oct 24, 2023

marade commented Oct 24, 2023

oschwengers commented Nov 17, 2023 • edited Loading

marade commented Nov 20, 2023 • edited Loading

oschwengers commented Nov 21, 2023

marade commented Nov 21, 2023

thorellk commented Nov 24, 2023 • edited Loading

thorellk commented Nov 26, 2023

oschwengers commented Nov 27, 2023

oschwengers commented Nov 27, 2023

thorellk commented Nov 27, 2023

oschwengers commented Nov 28, 2023

thorellk commented Dec 1, 2023

oschwengers commented Dec 1, 2023

thorellk commented Dec 3, 2023

thorellk commented Dec 29, 2023

thorellk commented Jan 9, 2024

oschwengers commented Jul 17, 2024

oschwengers commented Nov 17, 2023 •

edited

Loading

marade commented Nov 20, 2023 •

edited

Loading

thorellk commented Nov 24, 2023 •

edited

Loading