Support Tabix #125

romanzenka · 2019-01-22T16:47:39Z

This is similar to issue #107.

We would like an interface where one can specify chromosome + region, and obtains only the parts of the VCF that correspond to this query. This would most likely mean supporting tabix indexed vcf files. Reading the whole VCF is simply not going to happen.

The current skip and nrows parameters are not useful to us. To know those values, we would first have to index the vcf somehow - exactly the job for Tabix.

The text was updated successfully, but these errors were encountered:

knausb · 2019-01-22T22:07:40Z

Hello, I'd say this is relevant to #124 as well. Here is the documentation for tabix. I believe that bgzip -i creates the same index that tabix does. Could you validate that for me?

romanzenka · 2019-01-22T22:27:55Z

@knausb I do not think bgzip -i creates the same index. This is what source code of bgzip says:

fprintf(fp, "   -i, --index                compress and create BGZF index\n");
fprintf(fp, "   -I, --index-name FILE      name of BGZF index file [file.gz.gzi]\n");

The thing is, I do not see any input parameters for bgzip that would let you specify chromosome and position columns to base the index on. So I believe the BGZF index file must be something else - probably a map of all block offsets and their starts in the original file. In other words - this would let you seek if you know byte offsets to read from, but it won't let you make a query "chromosome 2, positions 10,000-20,000" like Tabix does. Would need to dig further to confirm.

knausb · 2019-01-22T22:29:21Z

The index created by bgzip is *.gzi while tabix creates *.tbi and the tabix index is a larger file. So they are not the same.

romanzenka · 2019-01-22T22:29:55Z

Yeah. Well, we need the tabix index - how to go from genomic coordinates to a chunk of data.

knausb · 2019-01-23T00:06:39Z

Thanks for the clarification in the indexes. I think you're correct. I found Rhtslib which I do not know how to use. But it looks like it may be a solution.

romanzenka · 2019-01-23T15:11:30Z

Ah, I just contributed a pull request to samtools/htsjdk#1248 which is a Java version of htslib. I do not have experience with the C version, but if it is anything like the java one, that's definitely a way to go.

Except I looked at Rhtslib and it is literally nothing but a wrapper around htslib. It has no interface, I believe you have to write C code to link to it. So it's not super useful for a casual R developer.

knausb · 2019-01-23T23:16:46Z

Thanks for the validation, that helps! I see Rhtslib at 1.7.0 while htslib is at 1.9.0. Does that matter? According to the documentation it sounds like they would be amenable to to incrementing Rhtslib.

"not super useful for a casual R developer" yeah, it looks like there's going to be a learning curve. But I think its the the way to go. File IO is a bottle neck and one of the things that inspired me to create vcfR. vcfR currently uses Rcpp to get that performance. And I want to believe all of these libraries will play nicely together. But it might take me a bit to figure it all out.

romanzenka · 2019-01-24T17:31:04Z

@knausb So my understanding is that you wrote vcfR using C code you made from scratch. I do not see anything wrong with your approach - htslib is however more of a building block for many kinds of genomic I/O beyond VCF. It might be wise to join forces with them, vcfR becoming a comfy wrapper around htslib, inheriting all the new fetures - that said, I do not know enough about their C code quality at the moment. It is quite possible your code is better, because it is more specialized and more oriented towards the R integration. That's something that would have to be benchmarked and figured out.

RichardJActon · 2020-06-30T18:52:46Z

This package might be useful for interfacing with tabix: https://github.com/zhanxw/seqminer

knausb mentioned this issue Aug 15, 2019

Support for bcf files #143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Tabix #125

Support Tabix #125

romanzenka commented Jan 22, 2019

knausb commented Jan 22, 2019

romanzenka commented Jan 22, 2019 •

edited

Loading

knausb commented Jan 22, 2019

romanzenka commented Jan 22, 2019

knausb commented Jan 23, 2019

romanzenka commented Jan 23, 2019 •

edited

Loading

knausb commented Jan 23, 2019

romanzenka commented Jan 24, 2019 •

edited

Loading

RichardJActon commented Jun 30, 2020

Support Tabix #125

Support Tabix #125

Comments

romanzenka commented Jan 22, 2019

knausb commented Jan 22, 2019

romanzenka commented Jan 22, 2019 • edited Loading

knausb commented Jan 22, 2019

romanzenka commented Jan 22, 2019

knausb commented Jan 23, 2019

romanzenka commented Jan 23, 2019 • edited Loading

knausb commented Jan 23, 2019

romanzenka commented Jan 24, 2019 • edited Loading

RichardJActon commented Jun 30, 2020

romanzenka commented Jan 22, 2019 •

edited

Loading

romanzenka commented Jan 23, 2019 •

edited

Loading

romanzenka commented Jan 24, 2019 •

edited

Loading