Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabix query missing bytes. #10

Closed
brentp opened this issue Sep 11, 2015 · 12 comments
Closed

tabix query missing bytes. #10

brentp opened this issue Sep 11, 2015 · 12 comments

Comments

@brentp
Copy link
Contributor

brentp commented Sep 11, 2015

apologies for the case, but it is reproducible. We are using a normalized version of ExAC for annotation and I ran trhough and intersected every variant with itself to make sure to get a hit. There is a single failure.

The file is here: http://s3.amazonaws.com/gemini-annotations/ExAC.r0.3.sites.vep.tidy.vcf.gz (and .tbi)

and the code is below. Run with:

go run main.go ExAC.r0.3.sites.vep.tidy.vcf.gz

It is querying for the location: location{"X", 229379 - 2, 229382}
And the output includes the variant of interest (X:229379), but it is appended to the end another line (X:229380)

Here is the output from the htslib tabx:

$ tabix  /usr/local/src/gemini_install/data/gemini/data/ExAC.r0.3.sites.vep.tidy.vcf.gz X:229378-229380 | cut -f 1-5
X   229275  .   CACACGGCGACCATGGGAACCCCCTCTCCTGGGCACGTGCTCACCGCAGCTGTCGTACGGCACCACTGAGACGACAGGGACCCCCTGCCCTCCCCCGGGCGAGTCCTCACCGGTG C
X   229379  .   CCTCACCGGTGACACGGAGACCGCGGAAGGCCCCTCCCCTGGGCGCGTG   C
X   229380  .   C   G
package main

import (
    "bufio"
    "compress/gzip"
    "fmt"
    "io"
    "io/ioutil"
    "os"
    "strings"

    "github.com/biogo/hts/bgzf"
    "github.com/biogo/hts/bgzf/index"
    "github.com/biogo/hts/tabix"
)

func check(err error) {
    if err != nil {
        panic(err)
    }
}

type location struct {
    chrom string
    start int
    end   int
}

func (s location) RefName() string {
    return s.chrom
}
func (s location) Start() int {
    return s.start
}
func (s location) End() int {
    return s.end
}

func main() {

    path := os.Args[1]

    fh, err := os.Open(path + ".tbi")
    check(err)

    gz, err := gzip.NewReader(fh)
    check(err)
    defer gz.Close()

    idx, err := tabix.ReadTabix(gz)
    check(err)

    b, err := os.Open(path)
    check(err)
    bgz, err := bgzf.NewReader(b, 1)
    //  bgz.Blocked = false

    check(err)

    chunks, err := idx.Chunks(location{"X", 229379 - 1, 229380})
    check(err)

    cr, err := index.NewChunkReader(bgz, chunks)
    check(err)
    br := bufio.NewReaderSize(cr, 32768/8)
    var j int
    for {
        line, err := br.ReadString('\n')
        if err == io.EOF {
            break
        }
        if strings.Contains(line, "\t229379\t") {
            fmt.Println(line + "\n")
        }

        j += 1
        _ = line
    }

    cr, err = index.NewChunkReader(bgz, chunks)
    buf, _ := ioutil.ReadAll(cr)

    for _, line := range strings.Split(string(buf), "\n") {
        if strings.Contains(line, "\t229379\t") {
            fmt.Println("read:" + line + "\n")
        }
    }
}
@kortschak
Copy link
Member

That is an unfortunately large file.

@kortschak
Copy link
Member

Shorter version via tail --bytes=+3161759710 ExAC.r0.3.sites.vep.tidy.vcf.gz > ExAC.r0.3.sites.vep.tidy-short.vcf.gz and use the following code:

package main

import (
    "bufio"
    "fmt"
    "os"
    "strings"

    "github.com/biogo/hts/bgzf"
    "github.com/biogo/hts/bgzf/index"
)

func check(err error) {
    if err != nil {
        panic(err)
    }
}

type location struct {
    chrom string
    start int
    end   int
}

func (s location) RefName() string {
    return s.chrom
}
func (s location) Start() int {
    return s.start
}
func (s location) End() int {
    return s.end
}

func main() {

    path := os.Args[1]

    b, err := os.Open(path)
    check(err)
    bgz, err := bgzf.NewReader(b, 1)
    check(err)

    chunks := []bgzf.Chunk{
        {Begin: bgzf.Offset{File: 0, Block: 64257}, End: bgzf.Offset{File: 9385, Block: 505}},
        {Begin: bgzf.Offset{File: 19801, Block: 6378}, End: bgzf.Offset{File: 68325, Block: 64953}},
    }

    cr, err := index.NewChunkReader(bgz, chunks)
    check(err)
    sc := bufio.NewScanner(cr)
    for sc.Scan() {
        line := sc.Text()
        if strings.Contains(line, "\t229379\t") {
            fmt.Println(line)
            break
        }
    }
}

It looks like the first chunk is finished and then the second continues. However, the vcf line has not completed.

@kortschak
Copy link
Member

@brentp If you could construct a smaller, non-vcf case, that would be very helpful.

@brentp
Copy link
Contributor Author

brentp commented Sep 14, 2015

@kortschak I don't have a grasp on how to do that. What would I try?

@kortschak
Copy link
Member

kortschak commented Sep 14, 2015 via email

@brentp
Copy link
Contributor Author

brentp commented Sep 14, 2015

Recent htslib cli tool
On Sep 14, 2015 3:18 PM, "Dan Kortschak" [email protected] wrote:

It's difficult. Can you tell me though what created the tabix index?


Reply to this email directly or view it on GitHub
#10 (comment).

@kortschak
Copy link
Member

Can you explain the reason for including two read throughs? Is that just to demonstrate two different failure modes?

@brentp
Copy link
Contributor Author

brentp commented Sep 14, 2015

I just wanted to make sure it wasn't something about the buffered reader.

@kortschak
Copy link
Member

Some improvement. With the following reproducer (on the complete file) I can see that the first record is being returned, but the second is corrupted:

package main

import (
    "bufio"
    "compress/gzip"
    "fmt"
    "os"
    "strconv"
    "strings"

    "github.com/biogo/hts/bgzf"
    "github.com/biogo/hts/bgzf/index"
    "github.com/biogo/hts/tabix"
)

func check(err error) {
    if err != nil {
        panic(err)
    }
}

type location struct {
    chrom string
    start int
    end   int
}

func (s location) RefName() string {
    return s.chrom
}
func (s location) Start() int {
    return s.start
}
func (s location) End() int {
    return s.end
}

func main() {

    path := os.Args[1]

    fh, err := os.Open(path + ".tbi")
    check(err)

    gz, err := gzip.NewReader(fh)
    check(err)
    defer gz.Close()

    idx, err := tabix.ReadFrom(gz)
    check(err)

    b, err := os.Open(path)
    check(err)
    bgz, err := bgzf.NewReader(b, 1)
    check(err)

    loc := location{"X", 229379 - 1, 229380}
    chunks, err := idx.Chunks(loc)
    check(err)

    cr, err := index.NewChunkReader(bgz, chunks)
    check(err)
    sc := bufio.NewScanner(cr)
    for sc.Scan() {
        line := sc.Text()
        fields := strings.Fields(line)
        start, err := strconv.Atoi(fields[1])
        check(err)
        start--

        end := start + len(fields[3])
        if start > loc.End() || end < loc.Start() {
            if strings.Contains(line, "\t229379\t") {
                fmt.Printf("corrupted: %q\n", line)
            }
            continue
        }

        fmt.Println(line)
    }
}

An additional point to note is that if the first chunk is given to NewChunkReader, then the 229275 is correctly returned and if the second chunk is returned, then the 229379 and 229380 records are correctly returned. So it looks like the jump from the first chunk to the second chunk is not properly clearing the data block.

@kortschak
Copy link
Member

I have an answer, but if you are able to take a look at the logic in index.Read that would be helpful.

The issue AFAICS is that the block end is being ignored, so the bytes X 229280 rs1867... are being included in the read for the first chunk.

PR coming.

@kortschak
Copy link
Member

This is what is happening (notes with a view to engineering a test case):

read r.r.LastChunk()={Begin:{File:3161759709 Block:64257} End:{File:3161759709 Block:64257}} r.chunks[0]={Begin:{File:3161759709 Block:64257} End:{File:3161769094 Block:505}} n=1279: """
X   229275  .   CACACGGCGACCATGGGAACCCCCTCTCCTGGGCACGTGCTCACCGCAGCTGTCGTACGGCACCACTGAGACGACAGGGACCCCCTGCCCTCCCCCGGGCGAGTCCTCACCGGTG C   650.91  PASS    AC=8;AC_AFR=0;AC_AMR=1;AC_Adj=4;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=4;AC_Hom=0;AC_NFE=2;AC_OTH=0;AC_SAS=1;AF=7.127e-05;AN=112254;AN_AFR=920;AN_AMR=734;AN_Adj=17662;AN_EAS=732;AN_FIN=122;AN_NFE=6506;AN_OTH=164;AN_SAS=8484;BaseQRankSum=0.804;ClippingRankSum=-0.278;DP=466234;FS=0;GQ_MEAN=18.12;GQ_STDDEV=17.94;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=0;Het_AMR=1;Het_EAS=0;Het_FIN=0;Het_NFE=2;Het_OTH=0;Het_SAS=1;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.0494;MQ=50.65;MQ0=0;MQRankSum=-1.426;NCC=9984;QD=0.1;ReadPosRankSum=0.358;VQSLOD=2.17;culprit=InbreedingCoeff;DP_HIST=27012|20162|2782|296|774|1019|880|755|794|606|420|268|171|84|49|22|13|8|3|9,2|2|0|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST=1480|23106|4214|4892|12929|1465|1391|522|105|141|42|11|3289|1003|411|325|218|145|108|330,0|0|0|1|1|1|0|0|0|0|0|0|1|0|0|0|1|0|0|3;CSQ=-|ENSG00000178605|ENST00000400701|Transcript|intron_variant&feature_truncation|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+44_227+157delCACCG
"""
read r.r.LastChunk()={Begin:{File:3161759709 Block:64257} End:{File:3161759709 Block:0}} r.chunks[0]={Begin:{File:3161759709 Block:64257} End:{File:3161769094 Block:505}} n=2817: """
GTGAGGACTCGCCCGGGGGAGGGCAGGGGGTCCCTGTCGTCTCAGTGGTGCCGTACGACAGCTGCGGTGAGCACGTGCCCAGGAGAGGGGGTTCCCATGGTCGCCGTGT|||||||||||||||||||,-|ENSG00000178605|ENST00000326153|Transcript|intron_variant&feature_truncation|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|YES||ENSP00000316598|GTPB6_HUMAN|Q8N2Q6_HUMAN&H0Y2S1_HUMAN|UPI0000246E6A||||2/6||ENST00000326153.4:c.229+44_229+157delCACCGGTGAGGACTCGCCCGGGGGAGGGCAGGGGGTCCCTGTCGTCTCAGTGGTGCCGTACGACAGCTGCGGTGAGCACGTGCCCAGGAGAGGGGGTTCCCATGGTCGCCGTGT|||||||||||||||||||
X   229279  .   C   T   1127.46 AC_Adj0_Filter  AC=1;AC_AFR=0;AC_AMR=0;AC_Adj=0;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=0;AC_Hom=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AF=8.823e-06;AN=113342;AN_AFR=962;AN_AMR=744;AN_Adj=17930;AN_EAS=750;AN_FIN=124;AN_NFE=6686;AN_OTH=164;AN_SAS=8500;BaseQRankSum=-1.537;ClippingRankSum=-0.55;DP=475956;FS=0.723;GQ_MEAN=18.39;GQ_STDDEV=18.19;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=0;Het_AMR=0;Het_EAS=0;Het_FIN=0;Het_NFE=0;Het_OTH=0;Het_SAS=0;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.0484;MQ=52.89;MQ0=0;MQRankSum=0.012;NCC=9119;NEGATIVE_TRAIN_SITE;QD=9.64;ReadPosRankSum=0;VQSLOD=-0.6443;culprit=InbreedingCoeff;DP_HIST=26353|21168|2915|294|835|1026|880|756|794|606|419|268|171|83|49|21|13|8|3|9,0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST=1429|22796|3977|4268|14540|1504|1440|541|106|132|45|15|3335|1005|413|327|217|145|108|328,0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1;CSQ=T|ENSG00000178605|ENST00000400701|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+154G>A|||||||||||||||||||,T|ENSG00000178605|ENST00000326153|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|YES||ENSP00000316598|GTPB6_HUMAN|Q8N2Q6_HUMAN&H0Y2S1_HUMAN|UPI0000246E6A||||2/6||ENST00000326153.4:c.229+154G>A|||||||||||||||||||
X   229280  rs186707630 G   A   7147.74 PASS    AC=27;AC_AFR=8;AC_AMR=0;AC_Adj=8;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=8;AC_Hom=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AF=0.0002375;AN=113676;AN_AFR=978;AN_AMR=760;AN_Adj=18132;AN_EAS=768;AN_FIN=130;AN_NFE=6806;AN_OTH=164;AN_SAS=8526;BaseQRankSum=1.54;ClippingRankSum=-0.067;DB;DP=479781;FS=4.366;GQ_MEAN=18.54;GQ_STDDEV=19.22;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=8;Het_AMR=0;Het_EAS=0;Het_FIN=0;Het_NFE=0;Het_OTH=0;Het_SAS=0;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.0475;MQ=56.62;MQ0=0;MQRankSum=0.322;NCC=8869;POSITIVE_TRAIN_SITE;QD=13.56;ReadPosRankSum=0.406;VQSLOD=0.811;culprit=InbreedingCoeff;DP_HIST=26078|21519|2972|302|851|1037|880|755|793|606|419|268|171|83|49|21|13|8|3|10,2|16|4|1|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1;GQ_HIST=1481|22552|3868|4418|14738|1515|14
"""
*change chunk*
read r.r.LastChunk()={Begin:{File:3161779510 Block:6378} End:{File:3161779510 Block:6378}} r.chunks[0]={Begin:{File:3161779510 Block:6378} End:{File:3161828034 Block:64953}} n=3208: """
X   229379  .   CCTCACCGGTGACACGGAGACCGCGGAAGGCCCCTCCCCTGGGCGCGTG   C   479.56  VQSRTrancheINDEL99.50to99.90    AC=1;AC_AFR=0;AC_AMR=0;AC_Adj=1;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=1;AC_Hom=0;AC_NFE=1;AC_OTH=0;AC_SAS=0;AF=8.361e-06;AN=119604;AN_AFR=6726;AN_AMR=6468;AN_Adj=87454;AN_EAS=6128;AN_FIN=3624;AN_NFE=50960;AN_OTH=612;AN_SAS=12936;BaseQRankSum=0.471;ClippingRankSum=0.527;DP=1340406;FS=5.166;GQ_MEAN=39.27;GQ_STDDEV=23.03;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=0;Het_AMR=0;Het_EAS=0;Het_FIN=0;Het_NFE=1;Het_OTH=0;Het_SAS=0;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.0084;MQ=52.13;MQ0=0;MQRankSum=-0.365;NCC=3529;QD=0.53;ReadPosRankSum=1.88;VQSLOD=-1.597;culprit=QD;DP_HIST=884|8073|7754|6713|16067|7869|4989|2527|1681|1063|727|519|350|219|163|75|55|28|10|36,0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST=394|4609|1672|2686|14277|3462|3699|2515|1058|1502|1184|486|15392|2869|1032|1090|696|276|289|614,0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1;CSQ=-|ENSG00000178605|ENST00000400701|Transcript|splice_region_variant&intron_variant&feature_truncation|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+6_227+53delCACGCGCCCAGGGGAGGGGCCTTCCGCGGTCTCCGTGTCACCGGTGAG|||||||||||||||||||,-|ENSG00000178605|ENST00000326153|Transcript|splice_region_variant&intron_variant&feature_truncation|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|YES||ENSP00000316598|GTPB6_HUMAN|Q8N2Q6_HUMAN&H0Y2S1_HUMAN|UPI0000246E6A||||2/6||ENST00000326153.4:c.229+6_229+53delCACGCGCCCAGGGGAGGGGCCTTCCGCGGTCTCCGTGTCACCGGTGAG|||||||||||||||||||
X   229380  .   C   G   631.08  PASS    AC=1;AC_AFR=0;AC_AMR=0;AC_Adj=1;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=1;AC_Hom=0;AC_NFE=0;AC_OTH=0;AC_SAS=1;AF=8.417e-06;AN=118806;AN_AFR=4260;AN_AMR=2896;AN_Adj=57006;AN_EAS=3510;AN_FIN=2248;AN_NFE=34976;AN_OTH=438;AN_SAS=8678;BaseQRankSum=-0.736;ClippingRankSum=0.95;DP=1413905;FS=2.911;GQ_MEAN=27.23;GQ_STDDEV=26.08;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=0;Het_AMR=0;Het_EAS=0;Het_FIN=0;Het_NFE=0;Het_OTH=0;Het_SAS=1;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.0882;MQ=57.58;MQ0=0;MQRankSum=0.363;NCC=4261;NEGATIVE_TRAIN_SITE;QD=15.78;ReadPosRankSum=0.736;VQSLOD=-0.3403;culprit=MQ;DP_HIST=734|6916|6503|6598|16298|8487|5063|2966|1902|1225|868|588|438|266|201|108|72|48|31|91,0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST=18209|4193|1647|1163|9918|2377|1857|1135|631|570|476|270|13229|1794|615|498|288|141|114|278,0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1;CSQ=G|ENSG00000178605|ENST00000400701|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+53G>C|||||||||||||||||||,G|ENSG00000178605|ENST00000326153|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|YES||ENSP00000316598|GTPB6_HUMAN|Q8N2Q6_HUMAN&H0Y2S1_HUMAN|UPI0000246E6A||||2/6||ENST00000326153.4:c.229+53G>C|||||||||||||||||||
X   229385  .   C   T   30700.6 PASS    AC=40;AC_AFR=1;AC_AMR=3;AC_Adj=40;AC_EAS=7;AC_FIN=0;AC_Hemi=0;AC_Het=40;AC_Hom=0;AC_NFE=28;AC_
"""

This is what should happen:

read r.r.LastChunk()={Begin:{File:3161759709 Block:64257} End:{File:3161759709 Block:64257}} r.chunks[0]={Begin:{File:3161759709 Block:64257} End:{File:3161769094 Block:505}} n=1279: """
X   229275  .   CACACGGCGACCATGGGAACCCCCTCTCCTGGGCACGTGCTCACCGCAGCTGTCGTACGGCACCACTGAGACGACAGGGACCCCCTGCCCTCCCCCGGGCGAGTCCTCACCGGTG C   650.91  PASS    AC=8;AC_AFR=0;AC_AMR=1;AC_Adj=4;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=4;AC_Hom=0;AC_NFE=2;AC_OTH=0;AC_SAS=1;AF=7.127e-05;AN=112254;AN_AFR=920;AN_AMR=734;AN_Adj=17662;AN_EAS=732;AN_FIN=122;AN_NFE=6506;AN_OTH=164;AN_SAS=8484;BaseQRankSum=0.804;ClippingRankSum=-0.278;DP=466234;FS=0;GQ_MEAN=18.12;GQ_STDDEV=17.94;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=0;Het_AMR=1;Het_EAS=0;Het_FIN=0;Het_NFE=2;Het_OTH=0;Het_SAS=1;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.0494;MQ=50.65;MQ0=0;MQRankSum=-1.426;NCC=9984;QD=0.1;ReadPosRankSum=0.358;VQSLOD=2.17;culprit=InbreedingCoeff;DP_HIST=27012|20162|2782|296|774|1019|880|755|794|606|420|268|171|84|49|22|13|8|3|9,2|2|0|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST=1480|23106|4214|4892|12929|1465|1391|522|105|141|42|11|3289|1003|411|325|218|145|108|330,0|0|0|1|1|1|0|0|0|0|0|0|1|0|0|0|1|0|0|3;CSQ=-|ENSG00000178605|ENST00000400701|Transcript|intron_variant&feature_truncation|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+44_227+157delCACCG
"""
read r.r.LastChunk()={Begin:{File:3161759709 Block:64257} End:{File:3161759709 Block:0}} r.chunks[0]={Begin:{File:3161759709 Block:64257} End:{File:3161769094 Block:505}} n=505: """
GTGAGGACTCGCCCGGGGGAGGGCAGGGGGTCCCTGTCGTCTCAGTGGTGCCGTACGACAGCTGCGGTGAGCACGTGCCCAGGAGAGGGGGTTCCCATGGTCGCCGTGT|||||||||||||||||||,-|ENSG00000178605|ENST00000326153|Transcript|intron_variant&feature_truncation|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|YES||ENSP00000316598|GTPB6_HUMAN|Q8N2Q6_HUMAN&H0Y2S1_HUMAN|UPI0000246E6A||||2/6||ENST00000326153.4:c.229+44_229+157delCACCGGTGAGGACTCGCCCGGGGGAGGGCAGGGGGTCCCTGTCGTCTCAGTGGTGCCGTACGACAGCTGCGGTGAGCACGTGCCCAGGAGAGGGGGTTCCCATGGTCGCCGTGT|||||||||||||||||||

"""
*change chunk*
read r.r.LastChunk()={Begin:{File:3161779510 Block:6378} End:{File:3161779510 Block:6378}} r.chunks[0]={Begin:{File:3161779510 Block:6378} End:{File:3161828034 Block:64953}} n=2312: """
X   229379  .   CCTCACCGGTGACACGGAGACCGCGGAAGGCCCCTCCCCTGGGCGCGTG   C   479.56  VQSRTrancheINDEL99.50to99.90    AC=1;AC_AFR=0;AC_AMR=0;AC_Adj=1;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=1;AC_Hom=0;AC_NFE=1;AC_OTH=0;AC_SAS=0;AF=8.361e-06;AN=119604;AN_AFR=6726;AN_AMR=6468;AN_Adj=87454;AN_EAS=6128;AN_FIN=3624;AN_NFE=50960;AN_OTH=612;AN_SAS=12936;BaseQRankSum=0.471;ClippingRankSum=0.527;DP=1340406;FS=5.166;GQ_MEAN=39.27;GQ_STDDEV=23.03;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=0;Het_AMR=0;Het_EAS=0;Het_FIN=0;Het_NFE=1;Het_OTH=0;Het_SAS=0;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.0084;MQ=52.13;MQ0=0;MQRankSum=-0.365;NCC=3529;QD=0.53;ReadPosRankSum=1.88;VQSLOD=-1.597;culprit=QD;DP_HIST=884|8073|7754|6713|16067|7869|4989|2527|1681|1063|727|519|350|219|163|75|55|28|10|36,0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST=394|4609|1672|2686|14277|3462|3699|2515|1058|1502|1184|486|15392|2869|1032|1090|696|276|289|614,0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1;CSQ=-|ENSG00000178605|ENST00000400701|Transcript|splice_region_variant&intron_variant&feature_truncation|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+6_227+53delCACGCGCCCAGGGGAGGGGCCTTCCGCGGTCTCCGTGTCACCGGTGAG|||||||||||||||||||,-|ENSG00000178605|ENST00000326153|Transcript|splice_region_variant&intron_variant&feature_truncation|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|YES||ENSP00000316598|GTPB6_HUMAN|Q8N2Q6_HUMAN&H0Y2S1_HUMAN|UPI0000246E6A||||2/6||ENST00000326153.4:c.229+6_229+53delCACGCGCCCAGGGGAGGGGCCTTCCGCGGTCTCCGTGTCACCGGTGAG|||||||||||||||||||
X   229380  .   C   G   631.08  PASS    AC=1;AC_AFR=0;AC_AMR=0;AC_Adj=1;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=1;AC_Hom=0;AC_NFE=0;AC_OTH=0;AC_SAS=1;AF=8.417e-06;AN=118806;AN_AFR=4260;AN_AMR=2896;AN_Adj=57006;AN_EAS=3510;AN_FIN=2248;AN_NFE=34976;AN_OTH=438;AN_SAS=8678;BaseQRankSum=-0.736;ClippingRankSum=0.95;DP=1413905;FS=2.911;GQ_MEAN=27.23;GQ_STDDEV=26.08;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=0;Het_AMR=0;Het_EAS=0;Het_FIN=0;Het_NFE=0;Het_OTH=0;Het_SAS=1;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.0882;MQ=57.58;MQ0=0;MQRankSum=0.363;NCC=4261;NEGATIVE_TRAIN_SITE;QD=
"""
read r.r.LastChunk()={Begin:{File:3161779510 Block:6378} End:{File:3161779510 Block:8690}} r.chunks[0]={Begin:{File:3161779510 Block:6378} End:{File:3161828034 Block:64953}} n=3449: """
15.78;ReadPosRankSum=0.736;VQSLOD=-0.3403;culprit=MQ;DP_HIST=734|6916|6503|6598|16298|8487|5063|2966|1902|1225|868|588|438|266|201|108|72|48|31|91,0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST=18209|4193|1647|1163|9918|2377|1857|1135|631|570|476|270|13229|1794|615|498|288|141|114|278,0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1;CSQ=G|ENSG00000178605|ENST00000400701|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+53G>C|||||||||||||||||||,G|ENSG00000178605|ENST00000326153|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|YES||ENSP00000316598|GTPB6_HUMAN|Q8N2Q6_HUMAN&H0Y2S1_HUMAN|UPI0000246E6A||||2/6||ENST00000326153.4:c.229+53G>C|||||||||||||||||||
X   229385  .   C   T   30700.6 PASS    AC=40;AC_AFR=1;AC_AMR=3;AC_Adj=40;AC_EAS=7;AC_FIN=0;AC_Hemi=0;AC_Het=40;AC_Hom=0;AC_NFE=28;AC_OTH=1;AC_SAS=0;AF=0.0003372;AN=118632;AN_AFR=3484;AN_AMR=2302;AN_Adj=47486;AN_EAS=2892;AN_FIN=2048;AN_NFE=30098;AN_OTH=394;AN_SAS=6268;BaseQRankSum=0.127;ClippingRankSum=-0.092;DP=1527982;FS=0.663;GQ_MEAN=23.62;GQ_STDDEV=29.54;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=1;Het_AMR=3;Het_EAS=7;Het_FIN=0;Het_NFE=28;Het_OTH=1;Het_SAS=0;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.1304;MQ=58.7;MQ0=0;MQRankSum=1.08;NCC=4429;NEGATIVE_TRAIN_SITE;QD=14.47;ReadPosRankSum=0.289;VQSLOD=-0.565;culprit=MQ;DP_HIST=541|4987|4907|6378|17069|9274|5653|3378|2130|1378|1063|770|548|380|280|174|124|70|59|153,0|0|2|1|8|11|8|2|5|1|2|0|0|0|0|0|0|0|0|0;GQ_HIST=26908|2973|1176|654|6762|1601|1169|792|400|437|366|249|12658|1576|494|407|238|101|85|270,0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|40;CSQ=T|ENSG00000178605|ENST00000400701|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+48G>A|||||||||||||||||||,T|ENSG00000178605|ENST00000326153|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|YES||ENSP00000316598|GTPB6_HUMAN|Q8N2Q6_HUMAN&H0Y2S1_HUMAN|UPI0000246E6A||||2/6||ENST00000326153.4:c.229+48G>A|||||||||||||||||||
X   229386  .   G   A   4026.56 PASS    AC=3;AC_AFR=1;AC_AMR=0;AC_Adj=3;AC_EAS=0;AC_FIN=0;AC_Hemi=0;AC_Het=3;AC_Hom=0;AC_NFE=1;AC_OTH=0;AC_SAS=1;AF=2.529e-05;AN=118642;AN_AFR=3162;AN_AMR=1968;AN_Adj=43516;AN_EAS=2510;AN_FIN=1996;AN_NFE=28096;AN_OTH=350;AN_SAS=5434;BaseQRankSum=1.13;ClippingRankSum=-0.785;DP=1549171;FS=3.943;GQ_MEAN=22.11;GQ_STDDEV=27.99;Hemi_AFR=0;Hemi_AMR=0;Hemi_EAS=0;Hemi_FIN=0;Hemi_NFE=0;Hemi_OTH=0;Hemi_SAS=0;Het_AFR=1;Het_AMR=0;Het_EAS=0;Het_FIN=0;Het_NFE=1;Het_OTH=0;Het_SAS=1;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;InbreedingCoeff=-0.1455;MQ=48.87;MQ0=0;MQRankSum=-0.321;NCC=4422;NEGATIVE_TRAIN_SITE;QD=18.14;ReadPosRankSum=0.474;VQSLOD=4.82;culprit=MQ;DP_HIST=511|4639|4606|6396|17171|9397|5825|3371|2240|1386|1084|808|563|386|305|192|134|78|58|171,0|0|0|0|0|0|0|1|0|0|0|1|0|0|0|1|0|0|0|0;GQ_HIST=30242|2092|986|707|5530|1215|988|722|393|424|353|220|12535|1485|440|377|210|95|84|223,0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|3;CSQ=A|ENSG00000178605|ENST00000400701|Transcript|intron_variant|||||||1||-1|GTPBP6|HGNC|30189|protein_coding|||ENSP00000383537||Q8N2Q6_HUMAN|UPI000059DAA9||||2/5||ENST00000400701.3:c.227+47C>T|||||||||||||||||||,A|ENSG00000178605|ENST00000326153|Transcript|intron_variant|||||||
"""

@brentp
Copy link
Contributor Author

brentp commented Sep 15, 2015

Here's another oddity that might help in debugging. If I used the ExAC file posted above and do a huge query for the chromsome, I get only a small amount of data returned:

package main

import (
    "compress/gzip"
    "io/ioutil"
    "log"
    "os"

    "github.com/biogo/hts/bgzf"
    "github.com/biogo/hts/bgzf/index"
    "github.com/biogo/hts/tabix"
)

func check(err error) {
    if err != nil {
        panic(err)
    }
}

type location struct {
    chrom string
    start int
    end   int
}

func (s location) RefName() string {
    return s.chrom
}
func (s location) Start() int {
    return s.start
}
func (s location) End() int {
    return s.end
}

func main() {

    path := os.Args[1]

    fh, err := os.Open(path + ".tbi")
    check(err)

    gz, err := gzip.NewReader(fh)
    check(err)
    defer gz.Close()

    idx, err := tabix.ReadFrom(gz)
    check(err)

    b, err := os.Open(path)
    check(err)
    bgz, err := bgzf.NewReader(b, 2)

    check(err)

    chunks, err := idx.Chunks(location{"1", 1, 99999999999})
    check(err)

    cr, err := index.NewChunkReader(bgz, chunks)
    buf, _ := ioutil.ReadAll(cr)

    log.Println(len(buf))

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants