Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabix: change Chunks signature to resemble csi and bam #46

Merged
merged 1 commit into from
Nov 7, 2016

Conversation

kortschak
Copy link
Member

API-breaking change.

@brentp Please take a look.

Closes #44.

@brentp
Copy link
Contributor

brentp commented Nov 5, 2016

Did you mean instead to have sig Chunks(rid int, beg, end int) to match CSI and Bam.

@kortschak
Copy link
Member Author

No. There is no way to have them match that doesn't significantly degrade the API.

CSI and BAM are binary protocols that do not retain a string representation of the reference name in the index while tabix is a textish protocol that does. Removing the text representation from tabix and adding an int <-> string mapping API seems like to lose to me. You would need to have a RID(ref string) int method on the index or expose the slice of reference names to the user.

Fundamentally, while they use the same broad indexing approach, they are not the same.

@brentp
Copy link
Contributor

brentp commented Nov 6, 2016

I don't feel strongly either way, but I don't see the benefit of an API change if it doesn't unify with others signatures.
func (i *Index) IDs() map[string]int is already available in the public tabix api so it's trivial to go from name => rid.

@kortschak
Copy link
Member Author

We would still have bam indexing being different. I guess I'm still not sure what you really want here (not in terms of the change, but rather the broader context of what you think you will be saved - all it takes for your use AFAIUI is a type switch to handle the situation as it exists).

Using IDs as an accessory for getting Chunks makes every other use more onerous for tabix. The change here makes tabix use easier because the user does not need to make a shim if their type does not implement tabix.Record.

@brentp
Copy link
Contributor

brentp commented Nov 6, 2016

I wanted the tabix.Chunks and csi.Chunks to have the same sig. Since that's not going to happen, and
yes, I can type-switch and yes this is an improvement to tabix.Chunks().

@kortschak
Copy link
Member Author

I wanted the tabix.Chunks and csi.Chunks to have the same sig.

Understood, but I would like to understand the broader context. Without know what you are doing it not really possible to make an informed decision.

@brentp
Copy link
Contributor

brentp commented Nov 6, 2016

my specific thought was in bix: https://github.com/brentp/bix/blob/master/bix.go#L46

instead of having:

// Bix provides read access to tabix files.
type Bix struct {
    *tabix.Index
    bgzf *bgzf.Reader
       ...
}

I could have:

type Indexed interface {
    Chunks(???) []bgzf.Chunk
}

type Bix struct {
    Indexed
    bgzf *bgzf.Reader
    ...
}

Then, where I call Chunks (https://github.com/brentp/bix/blob/master/bix.go#L199) could remain mostly unchanged between csi and tabix.

@kortschak
Copy link
Member Author

That gets me closer to understanding, and from here I can see that you could wrap the tabix.Index in a struct to translate the ref name to a RID. However, I'm still not sure of the broader context how the Bix type would be used. I don't understand why CSI and tabix are being considered as part that fit into the same use; CSI is intended to be a component of a broader system (like BAI in BAM where BAM provides the ref names), while tabix is a more complete solution.

@brentp
Copy link
Contributor

brentp commented Nov 6, 2016

Hmm, well htslib can use csi for VCF files. I was thinking that would make a nice way to allow the user to control the block size of the index since a big cost is decompressing a full tabix block to grab a single record.

I'm not sure what extra magic it's doing to use csi with text files--I assume it's grabbing the ##contig out of the VCF header.

@kortschak
Copy link
Member Author

Please take a look at how htslib handles VCF with CSI (I don't know). That may lead the way here.

@brentp
Copy link
Contributor

brentp commented Nov 7, 2016

Looks like, in htslib, they are storing the seqnames in the meta/auxilary of CSI: https://github.com/samtools/htslib/blob/develop/tbx.c#L306
I think it's reading tbx_conf_t as the first 24 bytes which contains

typedef struct {
    int32_t preset;
    int32_t sc, bc, ec; // seq col., beg col. and end col.
    int32_t meta_char, line_skip;
} tbx_conf_t;

then it gets the length of the remaining bytes in the next 4. And then parsing the seq names add adding them to the dict at line 306. I verified that code gets executed when loading a CSI index for a VCF.

Just like the spec says, ahem.

@kortschak
Copy link
Member Author

Just like the spec says, ahem.

I have words. They are not nice words. Interaction with those "specs" is a constant source of pain.

@kortschak
Copy link
Member Author

In the absence of any clarity in the form of a specification for the behaviour here, I'm going to make the change proposed here.

For your problem, I would suggest that you then make a type to wrap csi.Index which implements Chunks(string, int, int) []bgzf.Chunk by mapping the strings in the aux to an int, as defined in the specification "in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.'" This way you will have a CSI and tabix with the same behaviour that you can embed.

@kortschak kortschak merged commit 1ab154f into master Nov 7, 2016
@kortschak kortschak deleted the tabix_chunks branch November 7, 2016 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants