Skip to content

cinquin/mutinack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

General notes

"Mutinack" is a program that detects mutations genome-wide with an error rate estimated to be as low as ~10-10, using Illumina paired end sequencing of minute amounts of starting material prepared following the SIP-HAVA-Seq protocol — as described in an accompanying manuscript [1] (see [2] for the original report of the duplex sequencing idea). Mutinack can report detailed statistics on mutations and top/bottom strand disagreements over any genome region defined in a custom BED file. While Mutinack was primarily designed with DNA-Seq in mind, it can accessorily be used to collapse RNA-Seq reads to only keep one set of reads per molecule in a cDNA library, which minimizes noise when quantifying gene expression (see e.g. [3-4]).

Installing and building Mutinack

Mutinack is a Java program that comes with all dependencies pre-packaged in a single mutinack.jar file. The only runtime requirement is a Java >= 8 runtime (and GNU make and Perl with module File::Grep for functional tests), and Mutinack should thus run on any platform that has such a runtime (FreeBSD, Mac OS X, and Linux have been tested with OpenJDK 1.8 or 1.9 early access build 161). Running under Java 9 requires a few module-related tweaks; see branch java9 of the repository. The mutinack.jar file can be directly downloaded, or built from a clone of this repository using ant jar (Ant is technically not required but makes the build very straightforward; Git is used to optionally include version information in the build). Be sure to perform a recursive clone of this repository (passing Git the --recursive argument). Note that on startup Mutinack checks whether a newer version is available for download; this can be disabled with the -skipVersionCheck flag.

Tests

Unit tests as well as a number of functional tests are provided. The former are available as Ant target junitreport, and the latter can be run using the run_functional_tests script in the base directory. Both sets of tests are run automatically as part of continuous integration builds. Each of the ~350 functional tests takes a relatively large amount of time to complete in part because of initialization costs, but the tests are run in parallel; the number of simultaneous jobs should be adjusted to match machine architecture using the N_PARALLEL_JOBS knob in the run_functional_tests script. Functional tests further rely on Nailgun (provided as a Git submodule) to avoid JVM startup costs and to optimize speed by using "warm" JVM instances; Nailgun needs to have been built for the tests to run. The functional test setup includes a mechanism to mark failed tests and automatically track them in a Git repository, which makes it possible to prioritize re-running of these tests when performing continuous integration builds across multiple machines (with e.g. Jenkins); this functionality is turned off by default to minimize initial configuration effort. Functional tests record code coverage using JaCoCo (use Ant target jacoco:report to produce a report).

The project is set up for mutation testing (these are mutations in code, not to be confused with mutations in DNA), using a custom version of PIT, enhanced to provide more details of how tests kill mutations and to be more robust against various errors such as memory exhaustion. This analysis identifies parts of the code that are sufficiently well constrained and covered by tests for changes introduced as code "mutations" to cause functional or unit tests to fail. Since the analysis is computationally costly, pre-computed mutation testing results have been made available here in HTML format (last update: January 2017). Note that as it is the analysis produces a number of false positives (i.e. mutations that that are not caught by tests, highlighted in red in the HTML output) for various reasons (e.g. because the mutated code improved performance without changing Mutinack's output, because it implemented internal sanity checks that also did not affect Mutinack's output, because explicit variable initialization to a default value is technically not required although it is useful to denote programmer intent, or because there is more than one way to achieve the exact same result). Note also that for now the tests do not aim to be exhaustive; they focus on the critical parts of the algorithm and of the output, rather than on all the functionality that is useful but that is not high priority to specifically test. The analysis can be run with ant mutationCoverage (run time is up to 1 day on a 64-core machine, depending on the mutators used). Two kinds of mutation testing analyses are provided: one with a test that compares a large subset of Mutinack's output to a precomputed snapshot, which highlights mutations that have a detectable effect on the output even if that effect is not explicitly tested for correctness, and one that does not include that "catch-all" test but that still achieves comparable mutation coverage.

The project is compiled with Google's "Error Prone" static analyzer, comes with Ant and Maven targets to run Findbugs, which is pre-packaged in the distribution, and has an Ant target to run SonarQube (the target disables reporting of a few issues that are of low interest or that generate false positives at a high rate, and disables reporting of all issues in contrib code). The code was also regularly analyzed with Coverity Scan, which has very nice additions to Findbugs and did not detect any significant issue. Note however that recent versions of Mutinack cannot be analyzed with Coverity, because of an apparent bug in Coverity that has been reported to Synopsys. The Findbugs analysis requires a custom Findbugs version (packaged with Mutinack) that contains a version of the ASM bytecode engineering library that we modified to work around a bug in the Java compiler documented here and that Oracle has acknowledged (but not yet fixed) as OpenJDK bug 8144185.

Note also that, at least on some versions of JRE 8, Mutinack can trigger a non-deterministic Java Virtual Machine bug, which was reported to Oracle and appears to have been resolved from JRE 8u74 onwards. This bug has only been observed in functional tests; to work around it, use the following JVM flags: -XX:CompileCommand=exclude,gnu.trove.impl.hash.TObjectHash::insertKey and -XX:CompileCommand=exclude,gnu.trove.impl.hash.TIntHash::insertKey.

Benchmarking

In a couple cases, benchmarks that rely on the Java Microbenchmarking Harness (JMH) are provided to justify the choice of a particular algorithm implementation over other possibilities. Results are included as comments above the declaration of the benchmark methods (your mileage may vary, the benchmarking environment is not the same as the environment for actual mutation detection runs, in particular when it comes to pressure on the garbage collector, beware of other processes running on the same machine and of e.g. "Turbo Boost" etc.). Build the benchmarks with ant clean && ant jmh_jar (note that running the clean target is not gratuitous) and run with java -jar benchmarks.jar.

Sample datasets

BAM files derived from alignment of C. elegans germ cell sequence data will be made available here.

Running Mutinack

Use java -jar mutinack.jar -help for a list of arguments and explanations. Mandatory arguments are marked with a star; at the minimum one must specify BAM files containing aligned sequence data (each BAM file must have a matching index file), and an indexed FASTA reference genome file. Alignments must be generated in a way that the custom barcodes are removed from the raw sequence data and placed in a BC attribute (with an optional BQ attribute containing the corresponding base read qualities), which can be achieved using the included trim_reads_move_BC Perl script. Alignments can be merged from multiple sequencing runs, even if the runs did not have identical read lengths. If multiple input BAM files are provided, mutations appearing in only one of these files will be highlighted by a star in the output and will be used to compute mutation rates.

Mutinack runs some operations in a multithreaded fashion. More parallelism can be extracted by specifying that contigs should be broken up into chunks to be treated independently, using the parallelizationFactor parameter; the performance gain is noticeable for well-chosen values of that parameter. Note that currently changes in parallelizationFactor lead to infinitesimal changes in computed coverage - this will be addressed in a future version.

Although efforts were made to choose algorithms and implementations that do not create an inordinate number of short-lived objects, object allocation and garbage collection can limit performance. We use the following JVM arguments, that were derived mostly by empirical means and that might need to be adapted for different architectures or usage patterns: -XX:+UseParallelGC -XX:ParallelGCThreads=10 -Xmx15G. With these settings we can achieve a throughput in excess of 3.5 million bases / s on our hardware (processing speed is data dependent).

Performance can be increased by turning off costly internal assertions that provide internal sanity checks, using -enableCostlyAssertions false; assertions that have a minimal performance impact can only be turned off by editing the code.

Under FreeBSD or Mac OS X, press control-T (or kill -INFO the process) to see a progress update on the standard error stream.

Mutinack output

JSON-formatted mutation detections and statistics are output to a file specified by the -outputJSONTo parameter. These outputs can also be parsed from tab-delimited columns in the standard output stream and auxiliary files specified as arguments.

Functionality to store outputs in a database is partially implemented (partially tested with PostgreSQL; the functionality relies on Java Data Objects and the DataNucleus implementation, and it should thus be straightforward to send the output to other database systems).

Distributed computing

Mutinack can use Java's Remote Method Invocation to run server and worker processes distributed over multiple JVM instances and multiple hosts, in a fault-tolerant fashion. See -submitToServer, -startWorker and -startServer options. Connections between different processes are secured and two-way authenticated using SSL. To ensure that the authentication cannot be trivially bypassed by would-be attackers (which may allow any user to access all files readable by the server process), be sure to run the included regenerate_SSL_files script after cloning the repository, or set up your own certificates and keys in the trust store and key store.

Contributing

Any contributions under the form of suggestions, bug reports, bug fixes, or new functionality will be greatly appreciated. Please create a Github issue and/or submit code as a pull request (the continuous integration system should automatically check the pull request).

Credits

Mutinack relies on the following libraries:

  • Picard tools by the Broad Institute (MIT License, Apache License V2.0) and its bzip2 and snappy dependencies: used to read BAM files. Some changes were made to improve performance (memoization of methods highlighted by profiling).

  • JCommander by Cédric Beust (Apache License V2.0): used to parse command-line arguments

  • GNU Trove (GNU Lesser General Public License): used for its implementations of Maps and Sets that minimize object creation, and for Lists of primitives. Small changes made are independently released here.

  • Furious object pool by eddie (Apache license)

  • Jackson by FasterXML (Apache License V2.0): used for JSON output

  • [fast-serialization](https://github.com/RuedigerMoeller/fast- serialization) by Ruediger Moeller (Apache License Version 2.0): used to cache data structures that are parsed (slowly) from text files

  • Google Protobuf (license): used for serialization of detailed genome coverage data

  • Apache Commons IO (Apache License Version 2.0)

  • An adapted version of [java-algorithms-implementation](https://github.com/phishman3579/java- algorithms-implementation) by Justin Wetherell (Apache License). Some changes were integrated upstream; others are available [here](https://github.com/cinquin/mutinack/blob/master/src/com/ jwetherell/algorithms/data_structures/IntervalTree.java).

  • Stanford CoreNLP by The Stanford Natural Language Processing Group (GNU Public License v3)

  • JUnit, [Hamcrest](https://code.google. com/p/hamcrest/), and JMockit (Eclipse Public License, BSD license, and MIT license, respectively): used for unit testing

  • PIT by Henry Coles (Apache License Version 2.0): used for mutation testing; the version included with Mutinack was adapted to provide more information on test failures, among other things

  • JaCoCo (Eclipse Public License v1.0): used for code coverage analysis

  • Findbugs (GNU Lesser General Public License): used for static analysis

  • Error Prone by Google (Apache License Version 2.0): used for static analysis at compile time

  • SonarQube (GNU Lesser General Public License): used for static analysis

  • ASM library by the OW2 Consortium (BSD license): used by Findbugs and JaCoCo; modified to work around OpenJDK bug 8144185

  • EqualsVerifier by Jan Ouwens (Apache License Version 2.0): used for testing of equals and hashCode methods

  • Logback, the SLF4J API and the Lidalia extension for logging (LGLP, MIT, and MIT X11 licenses, respectively): used for logging

  • RMIIO (Apache License Version 2.0): used to stream data over RMI. Modified to create a shaded jar artifact.

  • A jar file containing Eclipse Null/Nullable annotations is included for convenience (Eclipse Public License)

Travis and Coverity kindly provide free resources for continuous integration and static code analysis.

We use YourKit for performance analysis and debugging of memory leaks. YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit JavaProfiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

YourKit

License

Mutinack is released under the GNU Affero General Public License version 3.

References

[1]: Taylor, P.H., Cinquin, A., Cinquin, O. (2016). Quantification of in vivo progenitor mutation accrual with ultra-low error rate and minimal input DNA using SIP-HAVA-seq. Genome Research 26(11), 1600-1611. Abstract

[2]: Schmitt, M.W., Kennedy, S.R., Salk, J.J., Fox, E.J., Hiatt, J.B., and Loeb, L.A. (2012). Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A 109, 14508–14513.

[3]: Shiroguchi, K., Jia, T.Z., Sims, P.A., Xie, X.S. (2012). Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes. Proc Natl Acad Sci U S A 109, 1347–1352.

[4]: Fu, G.K., Xu, W., Wilhelmy, J., Mindrinos, M.N., Davis, R.W., Xiao, W., Fodor, S.P.A. (2014). Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Proc Natl Acad Sci U S A 111, 1891–1896.