[xray] Evict tasks from the lineage cache#2152
[xray] Evict tasks from the lineage cache#2152robertnishihara merged 6 commits intoray-project:masterfrom
Conversation
…committed lineage
…testing for out-of-order operations
|
Test FAILed. |
|
jenkins, retest this please |
|
Test PASSed. |
| } else { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Purely a style thing, but I think it'd be cleaner to have a single return statement and to do something like
bool task_present = it != subscribed_tasks_.end();
if (task_present) {
...
}
return task_present;and similarly for the above block.
src/common/state/ray_config.h
Outdated
There was a problem hiding this comment.
1000 feels pretty big to me
There was a problem hiding this comment.
Hmm, what do you think is a good size?
| // NOTE(swang): The number of entries in the uncommitted lineage also | ||
| // includes local tasks that haven't been committed yet, not just remote | ||
| // tasks, so this is an overestimate. | ||
| const auto uncommitted_lineage = GetUncommittedLineage(task_id); |
There was a problem hiding this comment.
This feels like an expensive operation, especially if we allow the uncommitted lineage to grow to size 1000. And this is called every time we forward a task, right? Will this be an issue?
There was a problem hiding this comment.
Yes, it will be called every time we forward a task. It shouldn't be too expensive since it's just walking a tree, but I guess it's hard to say. Would it help if we added a timing statement to GetUncommittedLineage and logged a warning if it exceeds maybe 1ms?
There was a problem hiding this comment.
Up to you about the timing statement, I assume it will be visible in the profiler if it's an issue.
There was a problem hiding this comment.
Hmm okay, I think it's a good sanity check. I'll run this quickly locally and see how many tasks it takes to get to 1ms.
| // includes local tasks that haven't been committed yet, not just remote | ||
| // tasks, so this is an overestimate. | ||
| const auto uncommitted_lineage = GetUncommittedLineage(task_id); | ||
| if (uncommitted_lineage.GetEntries().size() > max_lineage_size_) { |
There was a problem hiding this comment.
What goes wrong if max_lineage_size_ = 0?
There was a problem hiding this comment.
Hmm, I think that will basically just try to evict every single entry in the lineage, but the code itself should not break. The tradeoff is basically the lower max_lineage_size_ is, the more often a node will request notifications from the GCS.
|
I think this can be merged when the tests pass. I think for the timing statement either way is ok. |
|
Test FAILed. |
|
Okay, we can leave out the timing statement for now, but after a quick test, |
|
Test PASSed. |
* master: [autoscaler] GCP node provider (ray-project#2061) [xray] Evict tasks from the lineage cache (ray-project#2152) [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166) Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169) [rllib] Fix A3C PyTorch implementation (ray-project#2036) [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151) Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155) Implement Python global state API for xray. (ray-project#2125) [xray] Improve flush algorithm for the lineage cache (ray-project#2130) Fix support for actor classmethods (ray-project#2146) Add empty df test (ray-project#1879) [JavaWorker] Enable java worker support (ray-project#2094) [DataFrame] Fixing the code formatting of the tests (ray-project#2123) Update resource documentation (remove outdated limitations). (ray-project#2022) bugfix: use array redis_primary_addr out of its scope (ray-project#2139) Fix infinite retry in Push function. (ray-project#2133) [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093) Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
What do these changes do?
This improves the eviction algorithm for the lineage cache. Changes:
COMMITTEDstatus from the lineage cache. These are tasks that have been committed in the GCS. Instead of keeping these entries around for an indefinite time, tasks are now evicted from the cache as soon as a notification for their commit is received.REMOTEand must be evicted. This change periodically evicts these remote tasks by requesting a notification for their commit after the uncommitted lineage exceeds the configured size.