Circuit breaker grows indefinitely when >2GiB of mget is issued (and possibly at other places as well)

https://discuss.elastic.co/t/circuit-breaker-always-trips/109067

**Elasticsearch version** (`bin/elasticsearch --version`):
Version: 5.6.4, Build: 8bbedf5/2017-10-31T18:55:38.105Z, JVM: 1.8.0_144

**Plugins installed**: [analysis-icu]

**JVM version** (`java -version`):
openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)

**OS version** (`uname -a` if on a Unix-like system):
FreeBSD fe 11.1-STABLE FreeBSD 11.1-STABLE #0 r324684: Tue Oct 17 15:07:45 CEST 2017     root@builder:/usr/obj/usr/src/sys/GENERIC  amd64

**Description of the problem including expected versus actual behavior**:
Circuit breakers' size constantly grow after a short period of uptime. This happens (for now) only on two machines, which may be because of replication.
After the limit is reached, even a
curl http://localhost:9200/ fails with:
```json
{
   "error":{
      "root_cause":[
         {
            "type":"circuit_breaking_exception",
            "reason":"[parent] Data too large, data for [<http_request>] would be [13610582016/12.6gb], which is larger than the limit of [11885484441/11gb]",
            "bytes_wanted":13610582016,
            "bytes_limit":11885484441
         }
      ],
      "type":"circuit_breaking_exception",
      "reason":"[parent] Data too large, data for [<http_request>] would be [13610582016/12.6gb], which is larger than the limit of [11885484441/11gb]",
      "bytes_wanted":13610582016,
      "bytes_limit":11885484441
   },
   "status":503
}
```
With the default configuration, the cluster remains operational for some time. When it reaches the request breaker limit, all shards residing on the two failing machines become essentially unavailable.
After some time the failing nodes get dropped out and reconnect, but it can't automatically heal.
When I raise the breakers' limit to 2^63-1, the cluster remains operational, but the breaker size grows indefintely (growing around 160 GiB in 8 hours).

**Steps to reproduce**:
It is 100% reproduceable on our cluster. More hints below.
I need help (maybe a debug build) to figure out what causes it.

**Provide logs (if relevant)**:
I guess the root cause is that we have a too big multiget, which fails. It may be that this exception is not handled well and the 2 GiBs of size remains in the circuit breaker counter.
It would be pretty nice to log at least the mget doc _ids along with the following exception, so it would make easier to find out what docs have the problem.

```
[2017-11-25T08:06:18,532][DEBUG][o.e.a.g.TransportShardMultiGetAction] [fe00] null: failed to execute [org.elasticsearch.action.get.MultiGetShardRequest@165b2817]
org.elasticsearch.transport.RemoteTransportException: [fe32][10.6.145.237:9300][indices:data/read/mget[shard][s]]
Caused by: java.lang.IllegalArgumentException: ReleasableBytesStreamOutput cannot hold more than 2GB of data
        at org.elasticsearch.common.io.stream.BytesStreamOutput.ensureCapacity(BytesStreamOutput.java:155) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.stream.ReleasableBytesStreamOutput.ensureCapacity(ReleasableBytesStreamOutput.java:69) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.stream.BytesStreamOutput.writeBytes(BytesStreamOutput.java:89) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.Streams$FlushOnCloseOutputStream.writeBytes(Streams.java:266) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:406) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.bytes.BytesReference.writeTo(BytesReference.java:68) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.stream.StreamOutput.writeBytesReference(StreamOutput.java:150) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.index.get.GetResult.writeTo(GetResult.java:365) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.get.GetResponse.writeTo(GetResponse.java:201) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.get.MultiGetShardResponse.writeTo(MultiGetShardResponse.java:89) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1243) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1199) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1178) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:60) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:295) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:287) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1553) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.6.4.jar:5.6.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Circuit breaker grows indefinitely when >2GiB of mget is issued (and possibly at other places as well) #27525

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Circuit breaker grows indefinitely when >2GiB of mget is issued (and possibly at other places as well) #27525

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions