-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tabix: change Chunks signature to resemble csi and bam #46
Conversation
API-breaking change.
Did you mean instead to have sig |
No. There is no way to have them match that doesn't significantly degrade the API. CSI and BAM are binary protocols that do not retain a string representation of the reference name in the index while tabix is a textish protocol that does. Removing the text representation from tabix and adding an int <-> string mapping API seems like to lose to me. You would need to have a Fundamentally, while they use the same broad indexing approach, they are not the same. |
I don't feel strongly either way, but I don't see the benefit of an API change if it doesn't unify with others signatures. |
We would still have bam indexing being different. I guess I'm still not sure what you really want here (not in terms of the change, but rather the broader context of what you think you will be saved - all it takes for your use AFAIUI is a type switch to handle the situation as it exists). Using IDs as an accessory for getting Chunks makes every other use more onerous for tabix. The change here makes tabix use easier because the user does not need to make a shim if their type does not implement tabix.Record. |
I wanted the tabix.Chunks and csi.Chunks to have the same sig. Since that's not going to happen, and |
Understood, but I would like to understand the broader context. Without know what you are doing it not really possible to make an informed decision. |
my specific thought was in bix: https://github.com/brentp/bix/blob/master/bix.go#L46 instead of having:
I could have:
Then, where I call Chunks (https://github.com/brentp/bix/blob/master/bix.go#L199) could remain mostly unchanged between csi and tabix. |
That gets me closer to understanding, and from here I can see that you could wrap the tabix.Index in a struct to translate the ref name to a RID. However, I'm still not sure of the broader context how the Bix type would be used. I don't understand why CSI and tabix are being considered as part that fit into the same use; CSI is intended to be a component of a broader system (like BAI in BAM where BAM provides the ref names), while tabix is a more complete solution. |
Hmm, well htslib can use csi for VCF files. I was thinking that would make a nice way to allow the user to control the block size of the index since a big cost is decompressing a full tabix block to grab a single record. I'm not sure what extra magic it's doing to use csi with text files--I assume it's grabbing the |
Please take a look at how htslib handles VCF with CSI (I don't know). That may lead the way here. |
Looks like, in htslib, they are storing the seqnames in the meta/auxilary of CSI: https://github.com/samtools/htslib/blob/develop/tbx.c#L306
then it gets the length of the remaining bytes in the next 4. And then parsing the seq names add adding them to the dict at line 306. I verified that code gets executed when loading a CSI index for a VCF. Just like the spec says, ahem. |
I have words. They are not nice words. Interaction with those "specs" is a constant source of pain. |
In the absence of any clarity in the form of a specification for the behaviour here, I'm going to make the change proposed here. For your problem, I would suggest that you then make a type to wrap |
API-breaking change.
@brentp Please take a look.
Closes #44.