Skip to content

Conversation

@ericl
Copy link
Contributor

@ericl ericl commented Dec 8, 2018

What do these changes do?

This adds an experimental redis_max_memory flag that bounds the redis memory used per data shard. Note that this only applies to the non-primary shards which store the majority of the task and object metadata. Hence, stuff like client metadata is never evicted.

Since profiling data has a nested structure and cannot be LRU evicted, also make profiling controlled by collect_profiling_data, and disable it when redis_max_memory is set.

Analysis of redis's approximate LRU eviction algorithm: https://github.com/antirez/redis/blob/a2131f907a752e62c78ea6bb719daf9fe2f91402/src/evict.c#L118

We use maxmemory_samples 10. There is also a persisted eviction pool of 16 entries. This effectively gives us 26 tries per eviction to hit a old key (lower bound). Let's assume the most recent 30% of keys are required for stable operation, and we evict at 10000 QPS. Then:

>>> p_no_bad_eviction = 1 - 0.3**26
0.9999999999999746
>>> p_no_bad_eviction_year = p_not_bad_eviction**(10000*60*60*24*365)
0.9920143099318832

So a lower bound on reliability with approx LRU eviction is 99% per year. The actual reliability will be much higher of course since it's unlikely we need even 30% of the metadata, and also the eviction pool is persisted over time.

TODO:

  • stress test with long-running Ape-X cluster

Related issue number

#3306
#954
#3452

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9851/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9849/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9852/
Test FAILed.

// The task was not in the GCS task table. It must therefore be in the
// lineage cache.
RAY_CHECK(lineage_cache_.ContainsTask(task_id));
RAY_CHECK(lineage_cache_.ContainsTask(task_id))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the narrow waist at which we access evicted lineage, though there could be other sites I'm missing.

handler_warning_timeout_ms_(100),
heartbeat_timeout_milliseconds_(100),
num_heartbeats_timeout_(100),
num_heartbeats_timeout_(300),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising this to 30s since 10s is too easy to hit with random pauses (e.g., forking process takes a long time, or the kernel stalls compacting hugepages).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. It's possible that some of the tests are currently waiting for the full 10s, in which case that will become really slow. If that's the case and we observe that, then we can configure this parameter specifically in those tests.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9856/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Dec 8, 2018

Checked and Ape-X seems to be stable at a aggressive 500MB redis memory limit.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9862/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9864/
Test FAILed.

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working as intended for me.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9894/
Test FAILed.


# Check that we get warning messages for both raylets.
wait_for_errors(ray_constants.REMOVED_NODE_ERROR, 2, timeout=20)
wait_for_errors(ray_constants.REMOVED_NODE_ERROR, 2, timeout=40)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah for this test, can we actually do

internal_config=json.dumps({"num_heartbeats_timeout": 40}) or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

default=None,
type=int,
help="--redis-max-memory to pass to Ray."
" This only has an affect in local mode.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ericl What does "local mode" mean? Does this work in multi-node mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just means that when using a cluster, you need to pass --redis-max-memory to ray start and not train.py

@llan-ml
Copy link
Contributor

llan-ml commented Dec 9, 2018

For now the memory flush policy do not support multiple redis shards. I notice that this PR can limit the used memory of each redis shard.

If this PR is merged, does it mean we do not need redis memory flush anymore to some extent?

@ericl
Copy link
Contributor Author

ericl commented Dec 9, 2018

@llan-ml that's right, this should supercede redis flushing. I updated the doc page to to remove the old flushing documentation.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9895/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9896/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9897/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants