The tutorial for this exercise as part of my presentation on JavaScript and Bioinformatics. Here it is an overview as a slide deck.
This tutorial will walk you through creating a small web app which
- accesses the NCBI with bionode-ncbi
- performs a multiple sequence alignment with muscle through msa
- visualizes the results with biojs-msa
First, an introduction to creating an npm package and installing dependencies. Consuming and producing functions with the callback style will be explained, as well as providing ample experience with Node's streams. You will construct a (simple) RESTful API with Express. browserify will be demonstrated as a tool for bundling code, letting you use npm packages in the browser. Interoperability with R will be achieved by communicating between Node and an R script with stdio pipes and jsonlite.
Note, none of the heavy algorithmic lifting will be performed by JavaScript. So why then this push for scientific computing in JS? Well speed and memory intensive programs are nearly always written in C/C++. For example, msa from Bioconductor provides an "interface to the multiple sequence alignment algorithms ClustalW, ClustalOmega, and Muscle". SciPy also uses native code. JavaScript engines such as V8 from Google, SpiderMonkey from Mozilla and Chakra from Microsoft Edge have been getting faster and faster. See JS vs Python and these benchmarks from the Julia project. Speed is important for large tasks yes, but languages are normally chosen over others for their available packages. While it is true npm is the largest collection of open source modules, these are mostly related to web development. The communities for scientific computing in Python/CPython and bioinformatics in R with bioconductor have years of development and much more refined packages to chose from. But..the heavy lifting..is still done in C. C can and has been wrapped in Node in a variety of modules (see: node-gyp). V8 is C++ after all. Furthermore emscripten is a LLVM-to-JavaScript compiler. It can compile into asm.js which is a low-level subset of JS. Engines can recognize asm and make optimizations - you can end up with native code running as JS in the browser at about 70% the speed! Another exciting development is WebAssembly - it will let you compile C into a binary format that can run in the browser, with more languages coming down the road. So is the future set? Will scientific computing in JS become popular? I sure hope so:
- JS is the language of the web and is here to stay and evolve
- The best way to share high fidelity data visualizations is over the browser with JS and WebGL
- Electron can be used to create cross-platform desktop apps written in JS with file system access
- C programs can be wrapped in Node and are beginning to be compiled into asm.js. Thus the only boundary to msa for instance being implemented in Node is developers to write the glue.
- WebAssembly is coming!
- JS for searching and downloading data can be used on the server and in the browser. This can simplify development of bioinformatics web applications.
For more discussions regarding the implementation of these technologies and others (GPU computation for example), see codeforscience/webdata. As well a talk from Max Ogden of the Dat project, JavaScript, For Science!.
- Setting up the project
- Getting Dependencies
- bionode-ncbi
- Static File Server with Express
- NCBI Fetch
- Callbacks
- Pipes
- Into the Browser
- BioJS: MSA
- Interoperability with R
Install Node if you haven't done so already. The first thing you do when starting any modern JavaScript project is initialize
it (cd
into your project directory and):
npm init
You will be asked to provide values for certain keys. The name
will default to
the current directory. It's perfectly fine to just enter through everything.
This will create a package.json
file. This file defines everything about your
package in order to publish it to npm, define development dependencies and
project dependencies, to run arbitrary scripts, as well as interoperate with
other tools. For example, linters, like jshint can have it's options
specified from either a .jshintrc
file or inside package.json
under the
jshint
key. After running this, my package.json
looks like this (I switched
to MIT license):
{
"name": "js-bioinformatics-exercise",
"version": "1.0.0",
"description": "The tutorial for this [exercise][exercise] as part of my presentation on [JavaScript and Bioinformatics][js-and-bioinformatics].",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"repository": {
"type": "git",
"url": "git+https://github.com/thejmazz/js-bioinformatics-exercise.git"
},
"author": "",
"license": "MIT",
"bugs": {
"url": "https://github.com/thejmazz/js-bioinformatics-exercise/issues"
},
"homepage": "https://github.com/thejmazz/js-bioinformatics-exercise#readme"
}
Let's get started by downloading bionode-ncbi and making
sure to store it under our dependencies by using the --save
option:
npm install bionode-ncbi --save
You'll now notice the following has been added to the package.json
{
"dependencies": {
"bionode-ncbi": "^1.6.0"
}
}
As well, bionode-ncbi
lives at node_modules/bionode-ncbi
. You don't commit
this node_modules
folder - when someone downloads this package they can simply
run npm install
and it will install everything under the dependencies
and
devDependencies
keys. Thus managing dependencies in the npm ecosystem is
simple and robust. Note as well, when we installed bionode-ncbi, it ran npm install
inside the bionode-ncbi folder, and ever deeper for each dependency.
Refer to my slide overviewing the bionode-ncbi API.
This allows us to access the NCBI E-utilities through callbacks,
events, and streams. Which to use is up to you. Check out
try-bionode-esnext. This uses ES6 syntax but the concepts
are the same. For this tutorial I'll stick to ES5 however. Let's start writing
main.js
:
var ncbi = require('bionode-ncbi');
var fs = require('fs');
var query = ncbi.search('protein', 'mbp1');
function dataLogger(data) {
// Assumes `data` directory already exists
var fileName = 'data/' + data.uid + '.json';
fs.writeFileSync(fileName, JSON.stringify(data));
console.log('Wrote ' + fileName);
}
query.on('data', dataLogger);
We require bionode-ncbi
and fs
(filesystem) modules. query
is the object
returned by ncbi.search
. It emits a data
event, which we catch and pass
dataLogger
as the callback. This simply writes the retrieved JSON to the
folder data
. (You will need to mkdir data
first - just didn't want to
introduce checking if the dir exists, then make it, etc. into this minimal
example).
To run this,
node main.js
At this point you can ls data
and see what turned up!
There are quite a lot of results. These are all from different organisms. To be
able to quickly browse through them, we can set up an express static
file server with serve-index, in server.js
:
var express = require('express');
var serveIndex = require('serve-index');
var app = express();
app.use(serveIndex('data'));
app.use(express.static('data'));
app.listen(3000);
console.log('Express server listening on port 3000');
Start the server with node server.js
and navigate to http://localhost:3000
in
your web browser. You can now easily browse through the files. I highly
recommend the Chrome addon JSON
Formatter. This is what you should see:
Would be nice to implement left/right arrow to move between the files..but for
now this suits our purposes to get a quick overview of the data we are dealing
with and the general format. One way to generate a generic schema would be to
loop through each file and keep track of the keys and typeof(result[key])
that
are common among all.
Take a look at 1431055.json. Disclaimer: I got 1431055
from
the web results - instead of looking through each file - we could do that
programmatically of course though.
We have all these results for Mbp1 proteins in different organisms - why not
run an MSA and see if any regions are more conserved than others? To do this
we will need to retrieve sequences. For that, we can use the fetch
function
from bionode-ncbi. Check out the table for Entrez eFetch Databases.
Run the following just to test it out (after installing bionode-ncbi globally
so we can use it in the shell - npm install -g bionode-ncbi
):
bionode-ncbi fetch protein 1431055
This returns:
{
"id":"gi|1431055|emb|CAA98618.1| MBP1 [Saccharomyces cerevisiae]",
"seq":"MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDRKKAIRSASTSAIMETKRNNKKAEENQFQSSKILGNPTAAPRKRGRPVGSTRGSRRKLGVNLQRSQSDMGFPRPAIPNSSISTTQLPSIRSTMGPQSPTLGILEEERHDSRQQQPQQNNSAQFKEIDLEDGLSSDVEPSQQLQQVFNQNTGFVPQQQSSLIQTQQTESMATSVSSSPSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKVNKYLSKLVDYFISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFHWACSMGNLPIAEALYEAGTSIRSTNSQGQTPLMRSSLFHNSYTRRTFPRIFQLLHETVFDIDSQSQTVIHHIVKRKSTTPSAVYYLDVVLSKIKDFSPQYRIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTTISNKEGLTANEIMNQQYEQMMIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSPVSPSDYITYPSQIATNISRNIPNVVNSMKQMASIYNDLHEQHDNEIKSLQKTLKSISKTKIQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTKKLRKRLIRYKRLIKQKLEYRQTVLLNKLIEDETQATTNNTVEKDNNTLERLELAQELTMLQLQRKNKLSSLVKKFEDNAKIHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA"
}
Sweet, so we will be able to get sequences. However, not all of the search
results are worth comparing - let's filter out the ones that have mbp1
in
their title. The following code achieves that, in collect-seqs.js
:
var fs = require('fs');
// return array of uids of proteins with title containing `mbp1`
function filter(proteins, cb) {
var num = 0;
var filtered = [];
var check = function(err, data) {
if (err) cb(err);
num+= 1;
var obj = JSON.parse(data);
if (obj.title !== undefined && obj.title.toUpperCase().indexOf('MBP1') >= 0) {
filtered.push(obj.uid);
}
tryFinish();
};
var tryFinish = function() {
if (num === proteins.length) {
cb(null, filtered);
}
};
proteins.forEach(function(protein) {
fs.readFile('data/' + protein, check);
});
}
filter(fs.readdirSync('data'), function(err, mbp1s) {
if (err) console.error(err);
mbp1s.forEach(function(uid) {
fs.readFile('data/' + uid + '.json', function(err, data) {
if (err) console.error(err);
var obj = JSON.parse(data);
console.log(obj.title);
});
});
});
Now, that's pretty callback heavy. What is happening here and why? JavaScript
has one event loop, and only one function can run at a time. Thus the standard
way to consume asynchronous operations (i.e. those which will take time - for
example, reading the contents of a file or waiting for a web request) is
the idiomatic function(err,data)
. For example, reading a file:
fs.readFile('arbitraryBytes.ab', function(err, data) {
// handle error (poorly in this case)
if (err) console.error(err);
// do stuff with data!
var obj = JSON.parse(data);
})
But how do we produce these asynchronous operations? We define a function
that takes a callback function as a parameter. Then when everything is
really done, we call that function with our result: cb(null, data)
or if
something goes wrong, pass the error to it: cb(err)
.
So what is happening in filter
? First, we read every file that was passed
in:
proteins.forEach(function(protein) {
fs.readFile('data/' + protein, check);
});
Instead of writing an inline function, check
is declared elsewhere. check
is
an impure function which has side effects - it increments num
each time it
is called. When check
is finished what it it will call tryFinish
.
tryFinish
simply checks if num === proteins.length
, i.e. have all the files
been read and parsed. We do not know the order that these files were processed -
and it will be different each time. Thus, an alternative is to assign the result
into a fixed index in an array (yes, var a = []; a[5] = 'five'
is totally
valid JS). In this case the order is not too important so I just threw the uids
in there. Finally, when tryFinish
discovers that we have in fact processed
every file, it calls cb
with the null
as the error, and the produced data.
You might be beginning to notice what is called callback hell. There are
methods to avoid this endlessly indented dread however - you can chain
.then()
s of Promises (See also
bluebird,
q). ES6 introduces
generators which are a neat
type of "iterator" that you can pass in data to them mid state with
next(datum)
. Promises and generators can be combined (in a way which was not
originally intended but works great!) and ran through a "generator engine" of
some sort (like co) and the result is asynchronous
code that looks synchronous. Co essentially implements what is in the draft
for ES7, async/await
. Check out try-bionode-esnext to
see this next-next-generation JS put to use to consume callbacks without
indenting. Browsers are beginning to
implement ES6 features, in the
meantime you can use Babel to transpile ES6 into ES5.
Extra note: Promise.all([...])
is also quite useful and can be used to wait
for a bunch of async operations to finish.
That was quite a bit. But if you got here, and understand collect-seqs.js
thats great. Coming to grips with callbacks is central to understanding how
async operations are handled in a language that runs on one thread. You may
think, wow, one thread that doesn't sound too great! But it actually makes for
easily scalable RESTful APIs - on any request the server can take it without
being stuck handling someone else - the callbacks will flow in when they are
ready.
Take a look at the output node collect-seqs.js
produces.
As you can see, this data filtering pipeline which has been developed so far is beginning to become clouded by async idioms and basically - code. It's important to keep a clear and concise focus on what our code should achieve and how the purpose is readable from an outside observer. Yes, new ES6/ES7 features can make it much cleaner, but we are wasting time repeatedly writing and reading files. Wouldn't it be great if there was some way to process data piece by piece as it came in from the Internet? It would - and this is called a stream. If there is a package for R for dealing with BLOB streams please let me know, a quick Google search brought this but that does not appear at first glance terribly similar. Stream support in R would make for fantastic interoperability with bionode/gasket.
Lets revisit this data acquisition and filtering pipeline in the context of the
stream APIs bionode provides, in piped.js
:
var ncbi = require('bionode-ncbi');
var es = require('event-stream');
var filter = require('through2-filter');
ncbi.search('protein', 'mbp1')
.pipe(filter.obj(function (obj) {
return obj.title.match(/^mbp1p?.*\[.*\]$/i);
}))
.pipe(es.through(function (data) {
this.emit('data', data.title + '\n');
}))
.pipe(process.stdout);
That produces this output.
Why didn't we just do this in the first place you might ask? It's very important to understand callbacks - also, these different approaches may be superior in different scenarios.
Lets filter it down to just different species, extract the gi
id,
and fetch the sequence. In piped2.js
:
var ncbi = require('bionode-ncbi');
var es = require('event-stream');
var filter = require('through2-filter');
var concat = require('concat-stream');
var tool = require('tool-stream');
var concatStream = concat(function(array) {
console.log(array);
});
var species = [];
ncbi.search('protein', 'mbp1')
.pipe(filter.obj(function (obj) {
return obj.title.match(/^mbp1p?.*\[.*\]$/i);
}))
.pipe(filter.obj(function (obj) {
var specieName = obj.title.substring(obj.title.indexOf('[') + 1, obj.title.length-1);
specieName = specieName.split(' ').slice(0,1).join(' ');
if (species.indexOf(specieName) >= 0) {
return false;
} else {
species.push(specieName);
return true;
}
}))
.pipe(tool.extractProperty('gi'))
.pipe(ncbi.fetch('protein'))
.pipe(concatStream);
and produces this output which is an array of objects. Note the sequences are of quite varying lengths. For now, we won't perform a MSA and will view it in the browser with biojs-msa as if they all align from the start. Why? That we can do totally within the browser - but as far as I know there is no MSA implementation in JavaScript.
At the moment, there is no standard way of importing modules in the browser.
That is, require
is undefined. With ES6 import
and export
will be
available! We will use Browserify to bundle our
scripts from an entry point. Install browserify (npm install -g browserify
)
and bundle piped2.js
into bundle.js
with the debug option so we get source
maps:
browserify piped2.js -o public/bundle.js --debug
Now, I actually get an error doing this:
Error: Cannot find module 'browserify-fs' from '/Users/jmazz/Documents/repos/js-bioinformatics-exercise/node_modules/bionode-ncbi/node_modules/bionode-fasta/lib'
...
Perhaps someone missed the --save
on npm install browserify-fs
. Looking into
node_modules/bionode-ncbi/node_modules/bionode-fasta/package.json
, indeed, it
is not in dependencies
. But it is in devDependencies
! And as well there is:
"browser": {
"fs": "browserify-fs"
}
The browser
object is for browserify. So there definitely has been attempt
to make this work. Either way, we can get around this by doing
browserify piped2.js -o public/bundle.js --debug -r fs:browserify-fs
as I gathered from the browserify-fs readme. Another solution would
have been to npm install browserify-fs
inside bionode-fasta
. However you
should avoid modifying dependencies - and if you do - you should issue a pull
request.
Then create a simple public/index.html
(I got the css from
here):
<!doctype html>
<html>
<head>
<title biojs msa visualization> </title>
<link rel="stylesheet" href="msa.min.css" />
</head>
<body>
<script src="bundle.js"></script>
</body>
</html>
With some small modifications to server.js
we can get this running (It's important
to run through a localhost since opening html files with the browser doesn't let
you do as many things):
app.use('/data', serveIndex('data'));
app.use('/data', express.static('data'));
app.use(express.static('public'));
View it in chrome and open up developer tools (cmd
+option
+i
on OS X, or right-click->inspect). You will see there is an error:
Uncaught TypeError: Cannot read property 'write' of undefined, index.js:6
Expanding the error and looking through the trace, I find that line 43 makes this call:
module.exports.stdout = module.exports(process.stdout);
where in line 5 and 6 we have
module.exports = function(stream) {
var write = stream.write;
// ...
}
Hmm, do we have process
in the browser? Nope. Hence process.stdout.write
will fail. Browserify is supposed to replace these things - see advanced
options. However, I think
this requires require('process')
to work? Passing in --insert-globals
didn't
do the trick either. After inspecting the code this was my hacky fix:
In node_modules/bionode-ncbi/node_modules/nugget/package.json
I added
"browser": {
"single-line-log": false
}
Essentially just ignoring the module that is causing the issue. This is safe
because single-line-log
won't get used unless we use the verbose
option -
which apparently is not being passed due to this hack working..
browser-stdout and
process exist, and perhaps I'll put
together a pull request implementing those sometime.
Note: Most of the time, browserify works wonderfully! Pure-js modules will always work. Obviously there will be some issues when porting Node server code into the browser. It's unlikely the single-line-log author expected his module to be used in the browser.
I've written fix-nugget for this, and added it the postinstall script. So now at least anyone pulling this repository won't have those issues.
Browsing through the msa readme, I took the
"b) Import your own seqs" snippet and the "sequence model", to produce msa.js
:
var msa = require("msa");
// other requires from piped2.js
var msaDiv = document.createElement('div');
document.body.appendChild(msaDiv);
var concatStream = concat(function(sequences) {
sequences = sequences.map(function(seq) {
var props = seq.id.split('|');
seq.id = props[1];
seq.name = props[4];
return seq;
});
console.log(sequences);
var m = new msa({
el: msaDiv,
seqs: sequences
});
m.render();
});
// ncbi.search from piped2.js
Then ran npm run bundle
. (see scripts
in package.json
).
Here's what it looks like! Be sure to play a bit with the controls. You can change row order and find motifs (via RegEx) for example.
Check
piped3.js.
It produces a .ndjson
file, or newline delimited JSON. It's output is in
seqs.ndjson.
Using that as input, the msa bioconductor package produces
seqsAligned.ndjson.
But these are files. We want to work with streams!
jsonlite on CRAN
supports streaming of JSON, but only through ndjson. The following R script
uses msa from
Bioconductor to align our sequences. In msa.r
:
#!/usr/bin/env rscript
# Packages
if (!require(Biostrings, quietly=TRUE)) {
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
}
data(BLOSUM62)
if (!require(msa, quietly=TRUE)) {
source("https://bioconductor.org/biocLite.R")
biocLite("msa")
library(msa)
}
if (!require(jsonlite, quietly=TRUE)) {
install.packages("jsonlite")
}
# Open stdin connection
stdin <- file("stdin")
open(stdin)
# jsonlite parse stdin ndjson into data frame
seqs <- stream_in(stdin, verbose=FALSE)
# Create AAStringSet vector out of sequences
seqSet <- AAStringSet(c(seqs$seq))
# Make sure to set names so we can identify later!
seqSet@ranges@NAMES <- seqs$id
# Compute alignment with MUSCLE
msa <- msaMuscle(seqSet, order="aligned")
# Alter values in seqs data frame
for (i in 1:nrow(msa)) {
seqs$id[i] = msa@unmasked@ranges@NAMES[i]
seqs$seq[i] = as.character(msa@unmasked[i][[1]])
}
# Back to stdout
stream_out(seqs, verbose=FALSE)
Now, how to interact with this and JS? Well, its impossible to do it in the
browser. We can integrate this script into an Express API easily and then
request the aligned sequences from the frontend. R has support for pipes and
socket connections, which stream_in
and stream_out
from jsonlite can use. So
perhaps this R script can be made to fit into the stream in Node. To use msa.r
as a child process, it needs to be executable: chmod u+x msa.r
. See
streamMsa.js
:
var ncbi = require('bionode-ncbi');
var es = require('event-stream');
var filter = require('through2-filter');
var concat = require('concat-stream');
var tool = require('tool-stream');
var cp = require('child_process');
var ndjson = require('ndjson');
// Only supports one level deep property
// i.e. car['wheels'] and not car['wheels.tire']
// for that, do car.wheels['tire']
function propMatchRegex(obj, prop, regex) {
return obj[prop].match(regex);
}
function getProteinSeqs(req, res, next) {
var opts = req.opts;
// var species = [];
var rMSA = cp.spawn('/Users/jmazz/r/js-bioinformatics-exercise/msa.r');
var stream = ncbi.search('protein', opts.query);
opts.filters.forEach(function (f) {
stream = stream.pipe(filter.obj(f));
});
if (opts.uniqueSpecies) {
// This will actually belong to scope of function
var species=[];
stream = stream
.pipe(filter.obj(function (obj) {
var specieName = obj.title.substring(obj.title.indexOf('[') + 1, obj.title.length-1);
specieName = specieName.split(' ').slice(0,1).join(' ');
if (species.indexOf(specieName) >= 0) {
return false;
} else {
species.push(specieName);
return true;
}
}));
}
stream
.pipe(tool.extractProperty('gi'))
.pipe(ncbi.fetch('protein'))
.pipe(es.through(function (obj) {
this.emit('data', JSON.stringify(obj) + '\n');
}))
.pipe(rMSA.stdin);
var seqs=[];
rMSA.stdout
.pipe(ndjson.parse())
.on('data', function(data) {
seqs.push(data);
})
.on('end', function() {
res.send({
seqs: seqs
});
});
}
module.exports = {
getProteinSeqs: getProteinSeqs,
propMatchRegex: propMatchRegex
};
The corresponding GET request at /aligned
in server.js
:
var bodyParser = require('body-parser');
app.use(bodyParser.urlencoded({extended: true}));
var sMsa = require('./streamMsa');
var propMatchRegex = sMsa.propMatchRegex;
var getProteinSeqs = sMsa.getProteinSeqs;
// e.g. /aligned?q=mbp1
app.get('/aligned', [
function (req, res, next) {
req.opts = {
query: req.query.q,
vars: {
species: []
},
filters: [
function(obj) {
// e.g. /^mbp1.*\[.*\]$/i)
var regex = new RegExp('^' + req.query.q + '.*\\[.*\\]$', 'i');
return propMatchRegex(obj, 'title', regex);
}
],
uniqueSpecies: true
};
next();
},
getProteinSeqs
]);
Here our "handler stack" is an array of functions which follow function (request, response, next)
. We attach opts
to the req
object so
getProteinSeqs
can retrieve them. See writing
middleware for more
info. body-parser is middleware
which creates an object of url queries for you at req.query
. It can also
parse JSON (e.g. from PUT bodies) but we don't use that.
I modularized msa.js
a bit and added a little jQuery:
function runFetch() {
$.get('http://localhost:3000/aligned?q=' + $('#query').val()).then(function(data) {
createMSAViz(data.seqs);
});
}
$('#submit').on('click', function() {
msaDiv.innerHTML = 'Loading...';
runFetch();
});
Which needs this HTML:
<input type="text" id="query" placeholder="query"></input>
<button id="submit">Go</button>
Now we have an actual MSA that takes a search query! It will filter down to results to everything that starts with the query, then has anything, then [specie], and only takes unique species. Of course it would be nicer to provide all these options from the interface but this serves as a minimal example.