Skip to content
This repository has been archived by the owner on May 25, 2024. It is now read-only.

Latest commit

 

History

History
137 lines (112 loc) · 6.08 KB

overview.md

File metadata and controls

137 lines (112 loc) · 6.08 KB

Overview

Here is an example Nucleus program:

from nucleus.io import vcf

with vcf.VcfReader('/tmp/example.vcf.gz') as reader:
  print('Sample names in VCF: ', ' '.join(reader.header.sample_names))
  with vcf.VcfWriter('/tmp/filtered.tfrecord', header=reader.header) as writer:
    for variant in reader:
      if variant.quality > 3.01:
        writer.write(variant)

Let's go through it line by line.

from nucleus.io import vcf

nucleus/io is the directory containing Nucleus's Python classes for doing reading and writing. There are modules for each of the genomics formats that Nucleus supports, including FASTA, FASTQ, SAM/BAM and VCF. That directory also contains the base classes that define the APIs for all of the various readers and writers. In brief, each reader can either be iterated over or queried by a region, and every writer has just the single method write.

with vcf.VcfReader('/tmp/example.vcf.gz') as reader:

We use Python's with pattern because the reader object returned by vcf.VcfReader is smart enough to close the file at the end of the block when we do. vcf.VcfReader is also smart enough to realize that this particular input file is compressed, and deal with that transparently. Of course, vcf files don't have to be compressed; VcfReader would have accepted an uncompressed /tmp/example2.vcf just as well.

In order to use the range query functionality of VCF, we assume that BGZIP'd VCF files come along with a TABIX index that allows us to efficiently do random access queries on genomic ranges. For /tmp/example.vcf.gz we would assume the index is in the file /tmp/example.vcf.gz.tbi. The query method will raise an exception if there is no index file present (or if the file is text VCF, which is not indexable).

  print('Sample names in VCF: ', ' '.join(reader.header.sample_names))

This does what you would expect; it prints the names of all the samples present in the input VCF file. Every Nucleus reader object contains a .header field with information about the file's header and metadata. (For some file types, like reference FASTA files, the header information also comes from the file's index's header.)

The names of the subfields of the .header of a VCF file can be found in the definition of the protocol buffer message VcfHeader in variants.proto. Using protocol buffers is mostly intuitive -- you can treat messages as named tuples, repeated fields as lists, map fields as dictionaries, etc. -- and for more information you can refer to the protocol buffer documentation.

  with vcf.VcfWriter('/tmp/filtered.tfrecord', header=reader.header) as writer:

You might be surprised by the file name here. vcf.VcfWriter can certainly write to VCF files, and if called with say vcf.VcfWriter('/tmp/out.vcf') or vcf.VcfWriter('/tmp/out.vcf.gz') it would do just that.

But Nucleus reader and writers can also read and write from the TFRecords file format used by TensorFlow. As the above example suggests, the desired file format is indicated by the file extensions: '.tfrecord' (or '.tfrecord.gz') means TFRecords, while '.vcf' means VCF.

It has been our experience that this ability to easily switch between native genomics files and TFRecords files makes it easier to gradually introduce machine learning into genomics workflows. We hope you find it convenient as well!

    for variant in reader:

As mentioned above, every Nucleus reader supports iteration. You can do it this way, directly using the reader object, or if you need a Python iterable to pass to another routine, you can get one from calling reader.iterate().

Each variant in this loop will be a Variant protocol buffer object, defined in variants.proto. Most of the Variant fields are easy to work with: variant.reference_name is the CHROM field, variant.start is the starting position, etc. But the variant.info and variant_call.info map fields are a little bit trickier because they contain typed information, and those types depend on the header definitions. To make things easier, Nucleus defines some convenience routines for setting and getting these fields in util/variant_utils.py and util/variantcall_utils.py.

      if variant.quality > 3.01:
        writer.write(variant)

These lines should now be easy to understand -- we test the quality field of the Variant, and if it is high enough, we write it to the output. Unsurprisingly, the write method of Nucleus writers takes a protocol buffer of the appropriate type as input.

(If you are wondering where the 3.01 came from, quality is measured on a Phred-scale, quality = -10 * log_10(probability), and -10 * log_10(0.5) is approximately 3.01.)

Here's a table to summarize the file types currently supported by Nucleus, and their associated protocol buffer types:

Format Record Type Header Type Reader? Writer?
BedGraph BedGraphRecord none Y Y
BED BedRecord BedHeader Y Y
FASTA Ref string RefFastaHeader Y N
FastQ FastqRecord none Y Y
GFF GffRecord GffHeader Y Y
SAM/BAM Read SamHeader Y Y
TFRecord Any Any Y Y
VCF Variant VcfHeader Y Y

Don't despair if your favorite genomics format isn't listed, though, as we hope to add more soon. (And contributions are welcomed!)