-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for long reference sequences #900
Comments
as far as I know the max length of a chromosome is max(int32)= 2,147,483,647 so there shouldn't have any problem on this side. There may be a problem when you're using a binning index (like bam.bai or tabix.tbi) : where the max size is only 512Mb (http://genomewiki.ucsc.edu/index.php/Bin_indexing_system ). But you can always use a csi index: https://www.biostars.org/p/111984/ |
Yes, the problem is the index. Does htsjdk support csi indexes? My understanding is it does not. I'm looking at the source code but don't see anything. |
It doesn't support CSI as far as I know. |
How much of a burden would it be to have support for csi indices?
…On Fri, Jun 16, 2017 at 10:02 AM, Louis Bergelson ***@***.***> wrote:
It doesn't support CSI as far as I know.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#900 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0pfnDdfi3komGZOeL1zaw8ATPB2lks5sEor2gaJpZM4N7i9G>
.
|
I'm not really sure, but we haven't always done a great job of handling index related things in a clean way, and I suspect we have client code that relies on the details of the bai index since it's been the only one we support for bam. I'd guess that it would be a fair amount of work to implement. |
from what I saw online, bai is a specialization of csi....so it might not be so bad...it would be good for someone knowledgable about the format to step in though.... |
Thanks all for looking into this: hoping someone capable (I am not capable) is interested in taking this on at some point. Janet |
There is another issue open regarding CSI index support in htsjdk: #447 |
Hi, Picard has a PR that allows to work with CSI index (from #1040) But I need htsjdk to be able to write CSI as well (example of Wheat genome):
How can I help? |
I think GATK SHOULD work with bams with a CSI index already as well. There's no VCF support at the moment which will probably be problematic though. I don't think we have any plan to implement it on our own in the near future. We just don't have the bandwidth to look into it, especially with the current pandemic craziness which is significantly cutting into people's coding time. However, if you're interested in contributing and have a fairly significant chunk of time on your hands we'd be able to work with you on getting a CSI writer into htsjdk. In theory it shouldn't be too hard, it's basically just a tiny modification to the bai writer. I suspect there will be a lot of pain points though with how the code is structured and (poorly) tested. @cmnbroad IS possibly going to be trying to modernize some of the index code it in the near future. He might be able to offer some guidance as to where to start if you're interested. We'd love the help, but it will probably take a while. |
Hey @lbergelson @cmnbroad hang in there, must be crazy where you are. RE index modernization, ping me if you do that, I'm interested in modernizing the igv.js code as well, might be some synergy. Currently both bam and tabix indexes are handled by the same code, with a few switches, this in turn was adapted from even older code. At least its relatively small https://github.com/igvteam/igv.js/blob/master/js/bam/bamIndex.js |
Perhaps the code developed for use by JBrowse could help? BAM Index
Tabix Index |
Thank you for all these comments. |
Hi Guys, We start seeing more and more genomes where the contigs exceed the limits of the .bai index. Not being able to write a .csi index is a significant limitation when working with large, well assembled, plant genomes. Adding this functionality would be very much appreciated. |
there was a project samtools/htsjdk-next-beta which aim was to implement long REFs (eg.: samtools/htsjdk-next-beta#6 ) . But the project looks inactive (dead ?) now. |
Yes, sadly no activity there. |
My group (Buckler Lab at Cornell University) also has a need for VCF CSI support for software we write/maintain. Our open source code (Buckler Lab PHG) uses gvcf files to store variant information for plant genomes and has run into problems working with the larger genomes .e.g wheat. Has anything changed in terms of scheduling CSI support for VCF? We would be interested in working on this in the htsjdk code base if there were someone with whom we could consult. @lbergelson @cmnbroad (or anyone else) will you comment/ update on where this stands? Thanks |
Hi there,
I'd like to add another voice for providing support for longer reference sequences: it already seems useful for some genomes, and as better assembled and larger genomes come out it seems like the need will get more frequent.
I've been working a little with the opossum monDom5 assembly where two chromosomes too long to be handled by IGV (chr1 is 748 Mb and chr2 is 541 Mb). I've been told it's an underlying limitation of htsjdk that creates this issue for IGV with longer reference sequences, so I'm posting the request here too.
Thanks for considering it,
Janet Young
Dr. Janet Young
Malik lab
http://research.fhcrc.org/malik/en.html
Division of Basic Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.
tel: (206) 667 4512
email: jayoung ...at... fredhutch.org
The text was updated successfully, but these errors were encountered: