TODO

heartbeat at task level to identify failed tasks ?


sort should be lazy
 - perhaps even learn buckets while sorting ... ? (i.e. single pass sort?)
 

the rx tx rates aren't updated on sp-dev
only start gui on driver by default


Enhance dashboard with info from https://pypi.python.org/pypi/psutil
Perhaps display skipped tasks clearer
Display task locality


Limit resource usage from supervisor with: https://docs.python.org/3.4/library/resource.html


check dependency graph for dependencies going packages down/up in wrong places
 - e.g. bndl.compute.schedule to .base
move caching aspect of datasets to one module (not spread over .schedule and .base)


move caches, broadcast values etc down to execute layer
use hierarchical namespace for set / del


Possibly introduce a control and data layer
 - i.e. to allow sending large volumes out of band from the control comms
 - this allows the watchdog to operate better as well
 - should improve stability
 

Consider not using 1 fixed worker per core, but 1 worker per node which forks per task
 - implement in execute layer
 - Use custom pickler to fetch broadcast value on receive (to ensure the data is available before forking)
 - a lazy alternative to the above would still require the broadcasted data to be available per executing process
 - caching could work through caching the data in pickled/marshalled form
   - although not having to unpickle when using cache was one of the benefits of not using pyspark
 - an advantage would be automatic cleanup after jobs have ran (less leakage)
 ! a major hurdle would be communication with e.g. Cassandra,
   - can't be inherited by a forking process and still work :(
   - would require performing the IO in the main process, and processing in a forked worker
 ! from the python docs:
   - Note that safely forking a multithreaded process is problematic.
 IDEA DROPPED:
   - the two points above are prohibitive
   - figure out an alternative memory efficient broadcast   
 
Consider a communication architecture with a core cross node network and a on node inter process network
 - perhaps supplemented by (temporary) direct connections on a data layer

 
Make entire compute.dataset api asynchronous
 - i.e. some_action() returns a future
 - use some_action().result() to get the results (may be an iterable)
 - use some_action().cancel() to cancel the job


Allow stages to be composed of an undetermined amount of tasks
 - i.e. allow a generator / queue of tasks
 
Allow jobs to be composed of an undetermined amount of stages
 - or put differently, give jobs the ability to yield barriers
 
The API / data model might change into a job which yields 1+ tasks and 0+ barriers.


Consider changing the task execution model from push to pull. 
 - could easy implementing having a task in flight
 - might be difficult with a generator / queue of tasks in combination with worker preferences

 
Implement check pointing
 - perform cleanup of stages before the stage of the checkpointed dset 


add pickle options to broadcast_pickle
 
 
Prevent occurences of:
 27158 : OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 4127376 max
 27158 : OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
Is triggered by import of numpy
 
 
support cassandra.coscan with _asdicts

try to implement read retry in cassandra scan without materializing an entire token range


implement drop function to drop columns / fields
 - like pluck?


check partition sizes for cassandra scan
cassandra scan .parts() is not stable


consider driving the tasks from another process / do something about liveness of the driver during jobs

support caching also for cancelled jobs / stages / tasks

nodes are 'in error' very quickly after starting a job
(irresponsive due to task deserialization?)


allow for easier debugging by reporting the failed partition (or more)

 
add psize for ctx.range


add max pcount / psize for ctx.range / ctx.collection


why stage.tasks.sort(key=lambda t: t.id) ? costs time for a large numer of tasksc


add protocol version in hello


Support ctx.files for tar files
Support for ctx.files(split=str) for gzipped files
	- gzip has seek, but no rfind


Crash task on driver side exceptions (e.g. in sending files which are read protected)


ctx.files.cleanup creates its own garbage in node.hosted_values


broadcast actual files without loading in the driver


cache_loc not filled when using first() ???


Whooops:

	In [86]: texts.uncache()
	ERROR:bndl.rmi.node:unable to perform remote invocation
	Traceback (most recent call last):
	  File "/home/frens.jan.rumph/venv/lib/python3.4/site-packages/bndl/rmi/node.py", line 51, in _request
	    response = (yield from asyncio.wait_for(response_future, self._timeout, loop=self.peer.loop))
	  File "/usr/lib64/python3.4/asyncio/tasks.py", line 381, in wait_for
	    raise futures.TimeoutError()
	concurrent.futures._base.TimeoutError
	ERROR:bndl.rmi.node:unable to perform remote invocation
	Traceback (most recent call last):
	  File "/home/frens.jan.rumph/venv/lib/python3.4/site-packages/bndl/rmi/node.py", line 51, in _request
	    response = (yield from asyncio.wait_for(response_future, self._timeout, loop=self.peer.loop))
	  File "/usr/lib64/python3.4/asyncio/tasks.py", line 381, in wait_for
	    raise futures.TimeoutError()
	concurrent.futures._base.TimeoutError
	ERROR:bndl.rmi.node:unable to perform remote invocation
	Traceback (most recent call last):
	  File "/home/frens.jan.rumph/venv/lib/python3.4/site-packages/bndl/rmi/node.py", line 51, in _request
	    response = (yield from asyncio.wait_for(response_future, self._timeout, loop=self.peer.loop))
	  File "/usr/lib64/python3.4/asyncio/tasks.py", line 381, in wait_for
	    raise futures.TimeoutError()
	concurrent.futures._base.TimeoutError
	WARNING:bndl.rmi.node:Response <Response exception=None, req_id=48, value=None> received for unknown request id 48
	WARNING:bndl.rmi.node:Response <Response exception=None, req_id=48, value=None> received for unknown request id 48
	WARNING:bndl.rmi.node:Response <Response exception=None, req_id=45, value=None> received for unknown request id 45


and another while nodes were reconnecting to driver:
	
	KeyError: 'nl.tgho.priv.sp-prod-adg02.worker.32226.0.4'
	ERROR:asyncio:Task exception was never retrieved
	future: <Task finished coro=<_serve() done, defined at /home/frens.jan.rumph/venv/lib/python3.4/site-packages/bndl/net/peer.py:262> exception=KeyError('nl.tgho.priv.sp-prod-adg02.worker.32226.0.4',)>
	Traceback (most recent call last):
	  File "/usr/lib64/python3.4/asyncio/tasks.py", line 236, in _step
	    result = coro.send(value)
	  File "/home/frens.jan.rumph/venv/lib/python3.4/site-packages/bndl/net/peer.py", line 270, in _serve
	    yield from self.local._peer_connected(self)
	  File "/home/frens.jan.rumph/venv/lib/python3.4/site-packages/bndl/net/node.py", line 224, in _peer_connected
	    del self.peers[known_peer.name]
	KeyError: 'nl.tgho.priv.sp-prod-adg02.worker.32226.0.4'
	WARNING:bndl.net.watchdog:<Peer:
	

support more syntactic sugar for creating accumulators, e.g.:
	acc = ctx.accumulator(set(), 'add')
	def task(...):
		nonlocal acc
		acc.add(1)
		
		
use local read if ctx.files on same node


Support shuffle with not all workers / use at runtime task dependencies


Implement cassandra.limit as
	docs = ctx.cassandra_table('adg', 'document', contact_points='sp-prod-adg01')

	In [27]: 10000 / sum(len(p.token_ranges) for p in docs.parts())
	Out[27]: 7.027406886858749
	
	In [28]: docs.limit(7).count(push_down=False)
	Out[28]: 9957


Is span_by working for spanning by part of the primary key?
	ctx.cassandra_table('adg_prod', 'authorship_features').select('doc_id', 'authorship_seq_no', 'affiliation_seq_no').span_by('doc_id', 'authorship_seq_no').take(10)
	yields
	[((83683000013, 1), Empty DataFrame
	  Columns: []
	  Index: [(83683000013, 1, 1)]), ((83683000027, 1), Empty DataFrame
	  Columns: []
	  Index: [(83683000027, 1, 1)]), ((83708200003, 1), Empty DataFrame
	  Columns: []
	  Index: [(83708200003, 1, 1)]), ((83708200003, 2), Empty DataFrame
	  Columns: []
	  Index: [(83708200003, 2, -1)]), ((83819800006, 1), Empty DataFrame
	  Columns: []
	  Index: [(83819800006, 1, 1)]), ((83819800006, 2), Empty DataFrame
	  Columns: []
	  Index: [(83819800006, 2, -1)]), ((83889400015, 1), Empty DataFrame
	  Columns: []
	  Index: [(83889400015, 1, 1)]), ((83889400015, 2), Empty DataFrame
	  Columns: []
	  Index: [(83889400015, 2, -1)]), ((83889400015, 3), Empty DataFrame
	  Columns: []
	  Index: [(83889400015, 3, -1)]), ((83889400015, 4), Empty DataFrame
	  Columns: []
	  Index: [(83889400015, 4, -1)])]


Shouldn't group_by_key and friends yield key, [value, ...] instead of key, [(key, value), ...] ?


Initializers / zero for reduce, sum,etc. 


span_by can be a lot more efficient! DataFrame.groupby is expensive

strip key from values from group_by_key

use default dict in CassandraCoScanPartition._materialize for merged


allow distribution and sort key in shuffle
 - e.g. distribute by block_id and sort by (block_id, person_id)
 - just like select col1, col2, col3 from table group by 1, 2
this is also important for cogroup, currently it has to materialize a partition into a list!

add operators such as set difference (using shuffle)

de select van cassandra_table kan checken of de kolom bestaat
 - wel rekening houden met udfs?
 
 
why is tree aggregate so slow?
 - is it the new shuffle?

 
cache van cass partitioner moet rekening houden met bndl_cassandra.part_size_keys and friends


investigate if we can do something better than yielding entire groups (from e.g. group_by_key) in a tuple


Reconsider shuffle implementation for keywise aggregation:
  - Spark uses uses create + merge + combine ...
  - probably cheaper than sorting ...


join_with_cassandra with a single element key doesn't work
  (getter should return an iterable)

  
remote cache read raw and decode on reading worker if was serialized in memory or on disk
Caching on multiple nodes

 
Checking group.executing_on in 
 <tr class="clickable {{ ' active' if group.executing_on else '' }}" data-href='group/{{ group.grouper }}'>
 

Compatibility with yappi 0.94


collect_as_files and friends -> save_as ... on worker


callsite for k-means doesn't work ... everything is a k-means iteration

implement k-means ||
implement k-means for sparse matrices


collect_as_files should check if directory exists before launching


shuffle is to slow ... also just a plain shuffle(n, sort=False)


Implement versions of itake (take and first) for cassandra coscan


Add named tuple like format for bndl_cassandra which supports __getitem__ to access fields.

Add config paramater for temp dir


Add pluck style to key_by
 - think about with_value (consts _are_ the value here, not a key)


Use key_or_getter with filter (and others?)


Add filterfalse (and friends?)


Add .pluck(attr='xxx')


Use paging state in bndl_cassandra scan


test require_local_workers and friends
readd prefer_workers?


Require workers at job / dataset level


Add dataset.cache context manager (with block)


Take locality into account for dataset.coalesce_parts(...)