Skip to content
This repository has been archived by the owner on Apr 2, 2019. It is now read-only.

SPIKE - Investigate Apache Beam / Google Cloud Dataflow as a runner for custom ingest code #11

Open
danxmoran opened this issue Sep 25, 2018 · 0 comments
Assignees

Comments

@danxmoran
Copy link
Contributor

When working on #2 and #4 I set up a dumb CLP wrapper with hacky caching, which I've been running through the sbt shell. This is fine for prototyping, but not good for the long run.

Apache Beam has been repeatedly mentioned as a data-processing framework which might be well-suited to the type of tasks run during data ingest.

  • Google provides Dataflow as a fully-managed runner for Beam, but it's not the only available solution.
  • Spotify maintains a Scala wrapper (scio) for the Java Beam APIs.

Porting the ENCODE metadata-processing code to use scio and seeing how it runs through Dataflow should give us a good idea for what it's like to use Beam.

@danxmoran danxmoran changed the title SPIKE - Investigate Apache Beam / Dataflow as a runner for custom ingest code SPIKE - Investigate Apache Beam / Google Dataflow as a runner for custom ingest code Sep 25, 2018
@danxmoran danxmoran changed the title SPIKE - Investigate Apache Beam / Google Dataflow as a runner for custom ingest code SPIKE - Investigate Apache Beam / Google Cloud Dataflow as a runner for custom ingest code Sep 25, 2018
@danxmoran danxmoran self-assigned this Oct 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant