Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: tabix header indices are off by one #215

Closed
zaeleus opened this issue Nov 13, 2023 · 2 comments
Closed

csi: tabix header indices are off by one #215

zaeleus opened this issue Nov 13, 2023 · 2 comments
Assignees
Labels
bug Something isn't working csi

Comments

@zaeleus
Copy link
Owner

zaeleus commented Nov 13, 2023

@zaeleus , I am trying to do this with the following code, but the index does not seem to be correct. what do I miss?
Thanks in advance:

Example
    use noodles::core;
    use noodles::csi::{self as csi, index::reference_sequence::bin::Chunk};
    use noodles::vcf::header::record::value::{map::Contig, Map};
    use noodles::vcf::record::Position;

    use tempfile::tempdir;

    fn write_vcf(
        temp_dir: &PathBuf,
        chrom: String,
        start: usize,
        reference: &str,
        alternate: &str,
    ) -> String {
        let vcf_path = temp_dir.join("test.vcf.gz");
        let fs = File::create(&vcf_path).expect("error creating file");

        let mut wtr = bgzf::Writer::new(&fs);

        let mut writer = vcf::Writer::new(&mut wtr);
        let id = chrom.parse().expect("error parsing chrom");

        let contig = Map::<Contig>::new();
        let header = vcf::Header::builder().add_contig(id, contig.clone()).build();

        writer.write_header(&header).expect("error writing header");

        let start_position = writer.get_ref().virtual_position();

        let chrom = chrom.parse().expect("error parsing chrom");
        let record = vcf::Record::builder()
            .set_chromosome(chrom)
            .set_position(Position::from(start))
            .set_reference_bases(reference.parse().expect("error parsing reference"))
            .set_alternate_bases(alternate.parse().expect("error parsing alternate"))
            .set_info("AF=0.1".parse().expect("error parsing AF"))
            .build()
            .expect("error building record");

        writer.write_record(&header, &record).expect("error writing test record");

        let end_position = writer.get_ref().virtual_position();
        let chunk = Chunk::new(start_position, end_position);

        let alignment_context = Some((
            0 as usize,
            core::Position::try_from(start).expect("bad start in test"),
            core::Position::try_from(start + reference.len()).expect("bad end in test"),
            true,
        ));

        let mut indexer =
            csi::index::Indexer::default().set_header(csi::index::header::Builder::vcf().build());

        indexer.add_record(alignment_context, chunk).expect("error indexing");
        let index = indexer.build(1 as usize);

        let index_path = vcf_path.clone().into_os_string().into_string().unwrap() + ".csi";

        let index_file = File::create(&index_path).expect("error creating index file");
        let mut index_writer = csi::Writer::new(index_file);
        index_writer.write_index(&index).expect("error writing index");

        wtr.finish().expect("error finishing bgzf");
        return vcf_path.into_os_string().into_string().unwrap();
    }

Originally posted by @brentp in #214 (comment)


When using noodles-csi, the column indices when reading and writing a tabix header are off by one. The error comes from the tabix header using 1-based column indices, whereas noodles normalizes them to be 0-based.

This only affects the CSI reader and writer, not tabix.

@zaeleus
Copy link
Owner Author

zaeleus commented Nov 13, 2023

@brentp, this issue is fixed in noodles 0.58.0 / noodles-csi 0.27.0.

The only other issue in the given example is that you must track the reference sequence names, in the case of VCF. E.g., since there's only one record and reference sequence name:

-    let mut indexer =
-        csi::index::Indexer::default().set_header(csi::index::header::Builder::vcf().build());
+    let reference_sequence_names = [chrom].into_iter().collect();
+    let header = csi::index::header::Builder::vcf()
+        .set_reference_sequence_names(reference_sequence_names)
+        .build();
+    let mut indexer = csi::index::Indexer::default().set_header(header);

Alternatively, use tabix::index::Indexer. This works on reference sequence names instead of IDs and adds them to the index header on build.

@brentp
Copy link

brentp commented Nov 14, 2023

thanks very much @zaeleus ! Indeed it's working with latest version and that change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working csi
Projects
None yet
Development

No branches or pull requests

2 participants