-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issues parsing genbank files produced by Pharokka #339
Comments
Hi @vmkhot , Thanks for this - 2021-2 me who wrote pharokka was pretty average at coding. A little bit improved now I hope! Pharokka needs a complete refactor to be honest at some point when I get some time. The issue steps from this part of pharokka Line 760 in 40c78f6
My question here is, are the ID and locus tag lines wrapping an issue? Or only the other qualifiers (VFDB, CARD etc) - I do also think that 'function' might be an issue with DNA, RNA metabolism. If you came across more please let me know. The ID and locus tags being too long I am not sure what fix I can put in unless I truncate them (which will almost certainly cause issues inside Pharokka). If I do this, that may make them not unique between CDS. I would say that for these, the user really needs to rename their contig IDs and/or use shorter locus tags. If it is only the VFDB/CARD qualifiers, well I agree that I should remove all spaces/quotes/brackets and should be an easy fix - seems pretty stupid by me to have them like this in hindsight. George |
Hi George, Thanks for your reply!
There might be more instances that are problematic but our workaround skipped over these so this is the only one I know of. In terms of renaming or truncating IDs and locus tags - I strongly prefer it when programs don't auto-rename data as I often use that information downstream to map results back and forth. I agree that the contig names are way too long in my dataset. Typically, my workflows include renaming my bins and contigs to meaningful headers before using programs like Pharokka, but the gbk files were not generated by me so just trying to make the most of what's available :) |
Thanks @vmkhot I'm glad the locus tags and IDs are ok. I should be able to fix this in the next update of pharokka by cleaning up the format-breaking metadata from CARD and VFDB (not that it will be helpful for you necessarily but still it will be helpful downstream!) - thanks for alerting me. George |
Hi @vmkhot , I've put in a fix to solve this issue (I hope) and it will be available in v1.7.3 soon. Regarding your data, I see you're in Jena with Bas - I think I was probably involved in generating it :) and have some improvements to make with https://github.com/gbouras13/phold coming soon. Best to move chat over email if you'd like, [email protected] George |
Hello,
While trying to parse gbk files produced by Pharokka using Biopython, I came into this error 4694
Essentially, some of the qualifier keys in the genbank records are too long and wrap to the next line but Biopython has no way to handle this.
I addressed it with Biopython developers and ended up editing file parser in Bio package itself (scanner.py) as a workaround.
Their suggestion was for me to reach out to you with the error also so that perhaps you can fix these wrap-around-keys
They also added a warning when writing genbank files with sketchy qualifier keys 4703
Thanks! and thanks for your tool too :)
Varada
The text was updated successfully, but these errors were encountered: