-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tabix query missing bytes. #10
Comments
That is an unfortunately large file. |
Shorter version via
It looks like the first chunk is finished and then the second continues. However, the vcf line has not completed. |
@brentp If you could construct a smaller, non-vcf case, that would be very helpful. |
@kortschak I don't have a grasp on how to do that. What would I try? |
It's difficult. Can you tell me though what created the tabix index?
|
Recent htslib cli tool
|
Can you explain the reason for including two read throughs? Is that just to demonstrate two different failure modes? |
I just wanted to make sure it wasn't something about the buffered reader. |
Some improvement. With the following reproducer (on the complete file) I can see that the first record is being returned, but the second is corrupted:
An additional point to note is that if the first chunk is given to NewChunkReader, then the 229275 is correctly returned and if the second chunk is returned, then the 229379 and 229380 records are correctly returned. So it looks like the jump from the first chunk to the second chunk is not properly clearing the data block. |
I have an answer, but if you are able to take a look at the logic in index.Read that would be helpful. The issue AFAICS is that the block end is being ignored, so the bytes PR coming. |
This is what is happening (notes with a view to engineering a test case):
This is what should happen:
|
Here's another oddity that might help in debugging. If I used the ExAC file posted above and do a huge query for the chromsome, I get only a small amount of data returned: package main
import (
"compress/gzip"
"io/ioutil"
"log"
"os"
"github.com/biogo/hts/bgzf"
"github.com/biogo/hts/bgzf/index"
"github.com/biogo/hts/tabix"
)
func check(err error) {
if err != nil {
panic(err)
}
}
type location struct {
chrom string
start int
end int
}
func (s location) RefName() string {
return s.chrom
}
func (s location) Start() int {
return s.start
}
func (s location) End() int {
return s.end
}
func main() {
path := os.Args[1]
fh, err := os.Open(path + ".tbi")
check(err)
gz, err := gzip.NewReader(fh)
check(err)
defer gz.Close()
idx, err := tabix.ReadFrom(gz)
check(err)
b, err := os.Open(path)
check(err)
bgz, err := bgzf.NewReader(b, 2)
check(err)
chunks, err := idx.Chunks(location{"1", 1, 99999999999})
check(err)
cr, err := index.NewChunkReader(bgz, chunks)
buf, _ := ioutil.ReadAll(cr)
log.Println(len(buf))
} |
apologies for the case, but it is reproducible. We are using a normalized version of ExAC for annotation and I ran trhough and intersected every variant with itself to make sure to get a hit. There is a single failure.
The file is here: http://s3.amazonaws.com/gemini-annotations/ExAC.r0.3.sites.vep.tidy.vcf.gz (and .tbi)
and the code is below. Run with:
It is querying for the location:
location{"X", 229379 - 2, 229382}
And the output includes the variant of interest (X:229379), but it is appended to the end another line (X:229380)
Here is the output from the htslib tabx:
The text was updated successfully, but these errors were encountered: