Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming download pool for JS API #21

Open
max-mapper opened this issue Jan 18, 2017 · 1 comment
Open

Streaming download pool for JS API #21

max-mapper opened this issue Jan 18, 2017 · 1 comment

Comments

@max-mapper
Copy link
Owner

it might make sense to do this as a separate module, and this might already exist, but I want this API:

// for example purposes assume urlStream is an object stream that emits a bunch of URLs as strings, or objects for request options
var nugget = require('nugget')
var downloader = nugget.createDownloadStream() // options could be the parallelism, and also you could pass defaults for request here as options.request or options.defaults maybe
pump(urlStream, downloader, function (err) {
  if (err) throw err
  console.log('done downloading')
})

Internally, createDownloadStream would start a configurable sized parallel queue (maybe powered by https://www.npmjs.com/package/run-parallel-limit). It would return a writable stream that you write urls into.

For every URL received, it should add it to the queue. It should emit events for when it starts and finishes each URL, as well as expose download progress through a static property/object somewhere on the createDownloadStream instance.

Error handling, it should only destroy the stream with an error if it's a catastrophic error. Maybe you can pass in a function that gets called with the (err, resp, body) for each request and that way you can handle the response yourself if you want?

Finally, when it downloads, it should do it like nugget/wget where it saves the resource to a file on disk. The file it saves as should be configurable in the object you write in as input. If you just write a single URL string as input, it should do with nugget does by default -- just use the http filename.

@max-mapper
Copy link
Owner Author

here's a proof of concept of the parallel stream/request portion:

var fs = require('fs')
var ndjson = require('ndjson')
var request = require('request')
var transform = require('parallel-transform')
var through = require('through2')

var PARALLEL = 1000

function getResponse (item, cb) {
  var r = request(item.url)
  r.on('error', function (err) {
    cb(err)
  })
  r.on('response', function (re) {
    cb(null, {url: item.url, date: new Date(), status: re.statusCode, headers: re.headers})
    r.abort()
  })  
}

fs.createReadStream('./meta.json')
  .pipe(ndjson.parse())
  .pipe(through.obj(function (obj, enc, next) {
    var self = this
    if (obj.resources) {
      obj.resources.forEach(function (r) {
        self.push(r)
      })      
    }
    next()
  }))
  .pipe(transform(PARALLEL, getResponse))
  .pipe(ndjson.serialize())
  .pipe(process.stdout)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant