Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improper rotation when dnaA hits cross the zero coordinate #90

Closed
marade opened this issue Dec 10, 2024 · 10 comments
Closed

improper rotation when dnaA hits cross the zero coordinate #90

marade opened this issue Dec 10, 2024 · 10 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@marade
Copy link

marade commented Dec 10, 2024

Pseudomonas aeruginosa chromosomes are often improperly rotated. For example many chromosome annotations hosted at NCBI have dnaA spanning the 0 coordinate, with a telltale join annotation like 'join(6478288..6478686,1..1146)'. I could cite dozens or maybe hundreds of these, but here are just a few:

(https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP147557.1?report=gbwithparts&log$=seqview)

https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP140615.1?report=gbwithparts&log$=seqview

https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP100653.1?report=gbwithparts&log$=seqview

I think this happens because P. aeruginosa has multiple dnaA boxes, which can fool annotation software into guessing that the dnaA gene starts later in the sequence than it actually does. Some details here:

https://pmc.ncbi.nlm.nih.gov/articles/PMC4119464/#S2

Most of these improperly rotated chromosomes can be corrected by rotating clockwise by 399bp, and dnaA then starts after the zero coordinate, as is the convention for bacterial annotations.

Looking at the dnaapler MMseqs output for the first example (GN04821), unfortunately the first 300 hits are wrong, and we finally start getting good hits with sp~B7V0N6~DNAA_PSEA8 matching to GN04821 with start position 6961242. If I rotate the contig by 2k clockwise and run dnaapler again, sp~B7V0N6~DNAA_PSEA8 is the top hit. This highlights a problem with linear references / alignment being used for circular problems like rotation, where the rotation of the contig before MMseqs is run leads dnaapler to the wrong conclusion.

To prevent problems like this, I wonder if you might chain hits crossing the zero coordinate? Or perhaps do one MMseqs run, then rotate all contigs halfway, then another MMseqs run? Or is there a better solution?

@oschwengers
Copy link

I think this is an important and obviously also often issue. $0.02 that I could add here is that also very often, on rotated chromosomes it is very hard to get the dnaA detection/annotation right. The issue is, that P(y)rodigal often fails to correctly predict dnaA at position 1 because the corresponding RBS is not properly detected and thus, the gene (if detected at all) is often detected as a partial hit, though 100% of its aa seq is present and does not cross the seq border. A simple solution might be to rotate the chrom by additional x basepairs. However, I'm not sure what a good value for x could be, maybe something between 50 bp to 100 bp?

@gbouras13 gbouras13 added enhancement New feature or request question Further information is requested labels Dec 10, 2024
@gbouras13
Copy link
Owner

Hi @marade @oschwengers ,

Thanks for this insightful discussion! I am partial to the suggestion by @marade of x being half the length of the contig - making it a fixed number will be tricky and inconsistent just because the relevant gene isn't just dnaA, and the smallest replicons (e.g. plasmids) might only be a couple thousand bases long.

Luckily with the change to MMSeqs2, simply running it twice isn't really a compute limitation question now given it is almost instant (whereas with BLAST it did take a while typically), so twice is fine.

Something like:

  1. rotate all replicons by half the length of the contig
  2. Run MMSeqs2 on both of the original and the half-rotated contigs
  3. Combine both outputs, and select the hit with the largest bitscore/lowest evalue per replicon

Seem reasonable?

In terms of actually doing this - maybe next week, I will need to find some time :)

George

@oschwengers
Copy link

oschwengers commented Dec 10, 2024

Hey @gbouras13 , sorry, I guess I wasn't specific enough. My suggestion for x did not relate to @marade 's suggestion for a 2nd round of "blasting" rep genes on a 1/2-rotated chromosome - this is absolutely straight forward and I would also vote for this. My suggestion addressed the question to which position a replicon should finally be re-oriented, once we have detected a proper dnaA/rep/etc gene. If we re-orientate a chrom so that dnaA is located at position 1, then P(y)rodigal often fails to properly detect this important gene, because its ribosomal binding site (RBS) is then located at the other end (linear perspective). This often results in wrong dnaA gene predictions (wrong coordinates) or correct coordinates but this gene being predicted as partial as P(y)rodigal can't detect its RBS.

To get some data into this dark matter, I extracted the intergenic regions for all 661k ATB/BakRep genomes around dnaA genes cropped at +/- 100 bp, and plotted their length:

$ awk '$1 > -100' dnaa-intergenic.txt | awk '$1 < 100' | hist -x -r -b 100

 92329|                                                                                       o              
 87470|                                                                                       o              
 82610|                                                                                       o              
 77751|                                                                                       o              
 72892|                                                                                       o              
 68032|                                            o                               o          o              
 63173|                                            o                               o          o              
 58313|                                            o                               o          o              
 53454|                                            o                               o          o              
 48595|                                            o                               o          o              
 43735|                                            o                               o          o              
 38876|                                            o                               o          o              
 34016|                                            o                               o          o              
 29157|                                            o                               o          o              
 24298|                                            o                 o             o          o              
 19438|                                            o                 o             o          o              
 14579|                                            o                 o             o          o              
  9719|                                            o                 o             o          o              
  4860|                                            o   o             o             o          o              
     1| o o              o  oo o oooooooooo oooooooo ooo oooooooooooooooooooooooooooooooooooooooooooooooooooo
       -----------------------------------------------------------------------------------------------------
       - - - - - - - - - - - - - - - - - - - - - - - - - 3 7 1 1 1 2 2 3 3 3 4 4 4 5 5 6 6 6 7 7 7 8 8 9 9 9 
       9 8 8 8 7 7 6 6 6 5 5 4 4 4 3 3 3 2 2 1 1 1 7 4 0 . . 1 4 8 2 6 0 4 7 1 5 9 3 6 0 4 8 2 6 9 3 7 1 5 9 
       2 8 4 0 6 2 9 5 1 7 3 9 6 2 8 4 0 7 3 9 5 1 . . . 5 3 . . . . . . . . . . . . . . . . . . . . . . .   
         . . . . . . . . . . . . . . . . . . . . . 9 1 3   2 1 9 7 6 4 2 0 8 7 5 3 1 9 8 6 4 2 0 9 7 5 3 1   
         1 3 5 7 9 0 2 4 6 8 9 1 3 5 7 8 0 2 4 6 7 6 4 2     4 6 8   2 4 6 8   2 4 6 8   2 4 6 8   2 4 6 8   
         8 6 4 2   8 6 4 2   8 6 4 2   8 6 4 2   8                                                           
                                                                                                             
                                                                                                             

-----------------------------------
|             Summary             |
-----------------------------------
|       observations: 277874      |
|      min value: -92.000000      |
|         mean : 37.981862        |
|       max value: 99.000000      |
-----------------------------------

I'm just posting this here to add my thoughts on this. However, I am aware that changing the position by a certain offset would break with the "dnaA at pos 1" convention and that ultimately this might be a bug that should be addressed best within P(y)rodigal itself.

@marade
Copy link
Author

marade commented Dec 10, 2024

@gbouras13 sounds like a good plan. I appreciate your work on this excellent tool.

Regarding @oschwengers comments about the final gene position, I think it's more important to not have features crossing the zero coordinate and have dnaA / rep genes / etc be the first gene than to have the first gene start at position 1. If it starts at say, position 100 or 400, that would still be reasonable in my opinion, so long as the other conditions are met. But as mentioned, perhaps this isn't best addressed with your tool.

@gbouras13
Copy link
Owner

@oschwengers - I see your point now. It will propagate the bakta annotations as partial. I actually think this might be a good issue to raise with Martin in the pyrodigal repository - let me raise an issue.

George

@oschwengers
Copy link

Yeah I think that if Pyrodigal would be able to detect RBS related to pos 1 dnaA /repA genes, leaving the pos 1 convention unchanged might be best. I totally agree that changing such a convention might be tricky and thus I totally understand your reluctance in this point. Sorry for kind of hijacking this issue with this related topic. I'll jump over with my discussion on this to the Pyrodigal issue you've opened.

Coming back to the initial issue, thanks a lot for addressing this. This will be a very good and nice improvement to an already super cool tool! Thanks.

@gbouras13
Copy link
Owner

Hi @oschwengers @marade,

I got around to implementing a fix to allow detection of the desired genes across contig ends. It will be available in v1.1.0 shortly. Please let me know if it gives the desired output - I have added some tests in CI and tried it out on a bunch locally, but of course that is no guarantee of anything :)

George

@marade
Copy link
Author

marade commented Jan 21, 2025

Testing this change, so far the results are indeed better. I imagine over time we might encounter cases where the new scheme doesn't work, but I haven't found any yet. Thanks for this improvement!

@marade
Copy link
Author

marade commented Feb 13, 2025

I found a few genomes that are not rotating correctly, despite the 1.1.0 changes. Perhaps there needs to be some accounting for mismatch and fident?

PA492
https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP178225.1

dnaA 21571-23115
hit 2 is correct; hit 1 has a higher alnlen, nident, and bits, but also a much higher mismatch and a much lower fident

===================================================

SM45
https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP177242.1

dnaA 6538114-6538512,1-1146
hit 2 is correct; hit 1 has a higher alnlen, nident, evalue, and bits, but also a much higher mismatch and a much lower fident

===================================================

YB1
https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP172962.1
https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP172963.1

dnaA 4773898-4775442
hit 2 is correct; hit 1 has a higher alnlen, nident, and bits, but also a much higher mismatch and a much lower fident

trfA 122872-123963
no correct hits

@gbouras13
Copy link
Owner

Hi @marade ,

Can you explain this in more detail what the issue is and can you confirm you are using v1.1.0? I think you might be using the old version. If not, can you upload your dnaapler outputs?

For dnaapler v1.1.0 it seems to be working fine for me.

e.g. PA492

Contig	Gene_Reoriented	Start	Strand	Top_Hit	Top_Hit_Length	Covered_Length	Coverage	Identical_AAs	Identity_Percentage	Overlapping_Contig_End
NZ_CP178225.1 Pseudomonas aeruginosa strain PA492 chromosome, complete genome	dnaA	21571	forward	sp~B7V0N6~DNAA_PSEA8	514	514	100.0	513	99.81	False

The top hit (which Dnaapler reorientated the chromosome with ) is:

NZ_CP178225.1	7021937	sp~B7V0N6~DNAA_PSEA8	514	514	21571	23112	1	514	0.998	513	0	1	0.000E+00	1008	VSVELWQQCVDLLRDELPSQQFNTWIRPLQVEAEGDELRVYAPNRFVLDWVNEKYLGRLLELLGERGEGQLPALSLLIGSKRSRTPRAAIVPSQTHVAPPPPVAPPPAPVQPVSAAPVVVPREELPPVTTAPSVSSDPYEPEEPSIDPLAAAMPAGAAPAVRTERNVQVEGALKHTSYLNRTFTFENFVEGKSNQLARAAAWQVADNLKHGYNPLFLYGGVGLGKTHLMHAVGNHLLKKNPNAKVVYLHSERFVADMVKALQLNAINEFKRFYRSVDALLIDDIQFFARKERSQEEFFHTFNALLEGGQQVILTSDRYPKEIEGLEERLKSRFGWGLTVAVEPPELETRVAILMKKAEQAKIELPHDAAFFIAQRIRSNVRELEGALKRVIAHSHFMGRPITIELIRESLKDLLALQDKLVSIDNIQRTVAEYYKIKISDLLSKRRSRSVARPRQVAMALSKELTNHSLPEIGVAFGGRDHTTVLHACRKIAQLRESDADIREDYKNLLRTLTT	MSVELWQQCVDLLRDELPSQQFNTWIRPLQVEAEGDELRVYAPNRFVLDWVNEKYLGRLLELLGERGEGQLPALSLLIGSKRSRTPRAAIVPSQTHVAPPPPVAPPPAPVQPVSAAPVVVPREELPPVTTAPSVSSDPYEPEEPSIDPLAAAMPAGAAPAVRTERNVQVEGALKHTSYLNRTFTFENFVEGKSNQLARAAAWQVADNLKHGYNPLFLYGGVGLGKTHLMHAVGNHLLKKNPNAKVVYLHSERFVADMVKALQLNAINEFKRFYRSVDALLIDDIQFFARKERSQEEFFHTFNALLEGGQQVILTSDRYPKEIEGLEERLKSRFGWGLTVAVEPPELETRVAILMKKAEQAKIELPHDAAFFIAQRIRSNVRELEGALKRVIAHSHFMGRPITIELIRESLKDLLALQDKLVSIDNIQRTVAEYYKIKISDLLSKRRSRSVARPRQVAMALSKELTNHSLPEIGVAFGGRDHTTVLHACRKIAQLRESDADIREDYKNLLRTLTT

and the second hit is

NZ_CP178225.1	7021937	sp~Q5FUU1~DNAA_GLUOX	479	252	2030203	2030916	127	375	0.257	65	3	170	2.618E-17	81	LSAQPGMKPIQLPLSVRLRDDATFANYYPG-ANAAALGYVERLCEAEAGWAESLIYLWGHDGVGRSHLLQA-------------ACLRFEQFEERTIYLPMADLVQYGPEIFDDLEQCELVCIDDLDVLVGKREWEEGLFHLFNRLRDTGRRLLLAASKSPRELQVKLPDLKSRLTMALIFQLHGLSDEDKLRALQLRASRRGLHLTDEVGRFILNRGSRSMNSLFDLLEQLDRASLQAQRKLTIPFLKETL	LATQAAAPESRSDLAVPLDPRFTFDTFIVGKPNEFAYACARRVAEKPSSPGFNPLFLYGGVGLGKTHLMHAIGTELTRTGKVSVAYMSAEKFMYRFI---AAIRSQSTMEFKEQLRSVDVLMIDDLQFLIGKDNTQEEFFHTFNALVDAGRQIIVSADKSPSDLSGLEDRLRTRLGCGMVADIHATTFELRISILEAKAKASGTHVPSKVLEYLAHKITTNVRELEGALNRLIAHADLVGRPVTLDTTQDVL

Looking at the genbank record, this is correct - the CDS from 21571..23115

Same story for SM45 - looks like it recovered the correct hit

Contig  Gene_Reoriented Start   Strand  Top_Hit Top_Hit_Length  Covered_Length  Coverage        Identical_AAs   Identity_Percentage        Overlapping_Contig_End
NZ_CP177242.1 Pseudomonas aeruginosa strain SM45 chromosome, complete genome    dnaA    6538114 forward sp~B7V0N6~DNAA_PSEA8       514     514     100.0   512     99.61   True

which matches Genbank.

With the plasmid, it found a gene with decent homology repA (no integrase trfA) annotated as hypothetical in the Genbank. For this there simply may be no good homologs I guess in the dnaapler database if you were looking for trfA, as Dnaapler uses terL.

Contig	Gene_Reoriented	Start	Strand	Top_Hit	Top_Hit_Length	Covered_Length	Coverage	Identical_AAs	Identity_Percentage	Overlapping_Contig_End
NZ_CP172963.1 Pseudomonas aeruginosa strain YB1 plasmid pYB1, complete sequence	repA	342702	reverse	UniRef90_C4NUQ3	366	340	92.9	109	32.06	False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants