Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

[PPL] Add Resource monitor to avoid OOM #533

Merged

Conversation

penghuo
Copy link
Contributor

@penghuo penghuo commented Jun 25, 2020

Description of changes:

  1. Add the Settings.
  2. Add the ResourceMonitor.

Query Engine Memory Circuit Breaker [WIP]

1.Overview

1.1.Problem Statements

In current Query Engine, the query execution could been divided into two major stages, plan stage and execution stage. The plan stage doesn’t load any data, so the memory usage could be ignored. At executin stage, then the physical execution plan is executed on the single node. The physical execution plan is composed by a chain of Iterator. There are two type of iterator, In-memory physical iterator and Elasticserach query. The Elasticsaerch query iterator submit the query request to Elasticsearch cluster then pass the response to the In-memory physical iterator for futher processing. The In-memory physical iterator process the event based on the heap memory. for example, the physical execution plan could be:
InMemSortIterator → InMemAggIterator → InMemAggIterator → ElasticsearchIndexScanIterator.

  • Making Elasticsearch resilient to failure is our top consideration, with these in-memory operator in place, it could comsume huge heap memeory and eventually making Elasticserach running into OutOfMemoryError issue.
  • Reject/Cancel the expensive query based on configuration, some query will be expensive to be executed, for example, sort, aggregation, instead of reject or cancel the query randomly, we should provide the feasibilty for user to cancel the expensive query by set resource limitation per query.

1.2.Estimate the per query memory usage is hard

Firstly, we should define what is the query memory usage. As we described in the problem statements section, the plan stage query memory usage could be ignored becase it didn’t process data. Then we should focus on the memory usage during query execution stage. Both the in-memory processiong and Elasticsearch query will use heap memory, then we can define the per query memory usage as the sum of memory usage of in-memory processing and Elasticsearch query. denoted as **PerQueryMemUsage = SUM(InMemOpMemUsage + ESQueryMemUsage)**.
Secondly, we should define how to calculate the InMemOpMemUsage and ESQueryMemUsage. Let’s talk about them seperately. InMemOpMemUsage could be calculated based on the Lucene’s RamUsageEstimator which could Estimates the size (memory representation) of Java objects. But ESQueryMemUsage is estimzated the Elastisearch internal implementation which is not exposed to client (Todo: need further confirm this with Elasticsearch source code).
Overall, missing the method to calcuate the ESQueryMemUsage make the PerQueryMemUsage estimatioin hard.

1.3.Do we need per query memory usage limit?

To answer this question, we could explore some use cases. One real use cases is in here. By using the per request circuit breaker, the user could fast fail the expensive query without crashing the entire cluster.

Configure the request memory circuit breakers so individual queries have capped memory usages, by setting indices.breaker.request.limit to 40% and indices.breaker.request.overhead to 2. The reason we want to set the indices.breaker.request.limit to 40% is that the parent circuit breaker indices.breaker.total.limit defaults to 70%, and we want to make sure the request circuit breaker trips before the total circuit breaker. Tripping the request limit before the total limit means ElasticSearch would log the request stack trace and the problematic query. Even though this stack trace is viewable by AWS support, it’s still helpful to for them to debug. Note that by configuring the circuit breakers this way, it means aggregate queries that take up more memory than 12.8GB (40% * 32GB) would fail, but we are willing to take Kibana error messages over silently crashing the entire cluster any day.

As we explain in the previous section, because we can not estimate the PerQueryMemUsage, the limit of per query memory usage is not accurate. Could we avoid this problem? Do we have workround to protect the cluster from crashing?
Let’s go back to the use cases, does it make sense to limit the expensive query which usurally much usefull for data analysis.
Instead of reject expensive query, if we reject query randomly and add backoff retry for memory allocation.

1.1.x.Summary

In this document, we are trying to find the way avoid in memory operator make Elasicseach running into OutOfMemoryError when handling the query request.

1.2.Tennet

We following the tennets when designing the feature:

  • Fast fall the requet before it have more negative impact for the node.
  • The solution could be confguraed by the customer based on their use case.
  • The metric should be exposed to customer for analysis.

1.2.Requirement

1.2.1.Function Requirement

  • The Query Engine should use limited memory usage, the limitation could be configured. (P0)
  • The Query Engine should rejected the new request when allocated resource already exhausted. (P0)
  • The Query Engine should failed the executing request, if it require memory more than allocated. (P0)
  • The Query Engine should limit the per request memory usage, the limitation could be configured. (P1)

1.2.2.Insight and Operation Requirement

  • The failed query request should be logged for further analysis.
  • The memory utilization metrics should be reported. (TBD, what metrics should be reported)

1.3.Out of scope

1.3.1.Physical plan optimizaiton

The goal of physical plan optimizaiton is try to find the plan which could running the query by taking the benfit from Elasticsearch and Lucene’s capability. The optimization doesn’t guarantee there is no in-memory operator exists. In reallity, it is impossible. If thre are in-memory operator in there, what we can do is avoid the in-memory operator make Elasticsearch OOM.

1.3.2.In memory operator optimization

There must be some solution which could improve the memory consumeption for each in-memory operator. For example, we can use file for sorting. But in this document, we don’t cover this topic in detail. Our discussion in here is the worst case sceneral, the better algorithm could make the query engine have better performance, but it didn’t impact the circuit breaker solution in here.

1.4.Measure of success

1.4.1.stability

Under certern pressue test, the test cluster should have no OOM.

1.4.2.error rate linear to load

Under certern load test, the request error rate should linear increase with load.

3.Solution

The following diagram explain the relationship between each component in the system.
[Image: Screen Shot 2020-06-25 at 10.19.49 AM.png]

Resource Monitor

ResourceMonitor only have one interface isHealthy which is used to monitor the resource healthy. The storage engine should has their own implementation for monitor the resource. e.g. ElasticsearchResourceMonitor provide the implementation for Elasticsearch.
Internally, ElasticsearchResourceMonitor only monitor the real memory usage now. The algorithm is
[Image: Screen Shot 2020-06-24 at 9.59.33 AM.png]

  • memUsage is calculate by using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory().
  • shouldFastFail could fast fail some query randomly
  • shouldRetry is defined by the retry policy, it will try 3 times with exponential random backoff with 1s interval.
    RetryConfig config =
        RetryConfig.custom()
            .maxAttempts(3)
            .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(1000))
            .retryExceptions(ElasticsearchMemoryMonitor.MemoryUsageExceedException.class)
            .ignoreExceptions(
                ElasticsearchMemoryMonitor.MemoryUsageExceedFastFailureException.class)
            .build();

Setting

User could use opendistro.ppl.query.memory_limit setting to config the limit for memory usage limitation. The default value 85%.

Appendix

1.Circuit Breaker in Elasticsearch

Elasticsearch has Circuit breaker settings in hierarchy strcuture.

Each breaker specifies a limit for how much memory it can use. Additionally, there is a parent-level breaker that specifies the total amount of memory that can be used across all breakers.

For example, the request circuit breaker allows Elasticsearch to prevent per-request data structures (for example, memory used for calculating aggregations during a request) from exceeding a certain amount of memory. It has indices.breaker.total.limit as the total limit.
From Elasticsearch 7.0.0, it release the https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, which determining whether the parent breaker should take real memory usage into account.

2.Circuit Breaker in other CrateDB

CrateDB integrate with Elasticsearch’s CircuitBreaker architecture by create it own CrateCircuitBreakerService. and exposed similar setting for client.
In the nutshell, Crate use Lucene’s RamUsageEstimator to estimate the Row ram usage. Crate create different SizeEstimator based on the column type. Further more, Crate add the optimization on top of it like SamplingSizeEstimator.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dai-chen
Copy link
Member

Just a general question for the design, how about we consider resource monitor as part of execution engine? I'm thinking in this case query engine and storage engine can be unaware of this, since this is essentially the detail of execution. If this makes sense, we can visit plan passed from planner and wrap operator that we feel may consume lots of resource, such as table scan.

Copy link
Member

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactoring. LGTM!

Copy link
Member

@chloe-zh chloe-zh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding the feature!

@penghuo penghuo merged commit df16ffe into opendistro-for-elasticsearch:develop Jun 29, 2020
@penghuo penghuo deleted the resource-monitor branch June 30, 2020 02:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants