Skip to content

Circuit breaker grows indefinitely when >2GiB of mget is issued (and possibly at other places as well) #27525

@bra-fsn

Description

@bra-fsn

https://discuss.elastic.co/t/circuit-breaker-always-trips/109067

Elasticsearch version (bin/elasticsearch --version):
Version: 5.6.4, Build: 8bbedf5/2017-10-31T18:55:38.105Z, JVM: 1.8.0_144

Plugins installed: [analysis-icu]

JVM version (java -version):
openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)

OS version (uname -a if on a Unix-like system):
FreeBSD fe 11.1-STABLE FreeBSD 11.1-STABLE #0 r324684: Tue Oct 17 15:07:45 CEST 2017 root@builder:/usr/obj/usr/src/sys/GENERIC amd64

Description of the problem including expected versus actual behavior:
Circuit breakers' size constantly grow after a short period of uptime. This happens (for now) only on two machines, which may be because of replication.
After the limit is reached, even a
curl http://localhost:9200/ fails with:

{
   "error":{
      "root_cause":[
         {
            "type":"circuit_breaking_exception",
            "reason":"[parent] Data too large, data for [<http_request>] would be [13610582016/12.6gb], which is larger than the limit of [11885484441/11gb]",
            "bytes_wanted":13610582016,
            "bytes_limit":11885484441
         }
      ],
      "type":"circuit_breaking_exception",
      "reason":"[parent] Data too large, data for [<http_request>] would be [13610582016/12.6gb], which is larger than the limit of [11885484441/11gb]",
      "bytes_wanted":13610582016,
      "bytes_limit":11885484441
   },
   "status":503
}

With the default configuration, the cluster remains operational for some time. When it reaches the request breaker limit, all shards residing on the two failing machines become essentially unavailable.
After some time the failing nodes get dropped out and reconnect, but it can't automatically heal.
When I raise the breakers' limit to 2^63-1, the cluster remains operational, but the breaker size grows indefintely (growing around 160 GiB in 8 hours).

Steps to reproduce:
It is 100% reproduceable on our cluster. More hints below.
I need help (maybe a debug build) to figure out what causes it.

Provide logs (if relevant):
I guess the root cause is that we have a too big multiget, which fails. It may be that this exception is not handled well and the 2 GiBs of size remains in the circuit breaker counter.
It would be pretty nice to log at least the mget doc _ids along with the following exception, so it would make easier to find out what docs have the problem.

[2017-11-25T08:06:18,532][DEBUG][o.e.a.g.TransportShardMultiGetAction] [fe00] null: failed to execute [org.elasticsearch.action.get.MultiGetShardRequest@165b2817]
org.elasticsearch.transport.RemoteTransportException: [fe32][10.6.145.237:9300][indices:data/read/mget[shard][s]]
Caused by: java.lang.IllegalArgumentException: ReleasableBytesStreamOutput cannot hold more than 2GB of data
        at org.elasticsearch.common.io.stream.BytesStreamOutput.ensureCapacity(BytesStreamOutput.java:155) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.stream.ReleasableBytesStreamOutput.ensureCapacity(ReleasableBytesStreamOutput.java:69) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.stream.BytesStreamOutput.writeBytes(BytesStreamOutput.java:89) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.Streams$FlushOnCloseOutputStream.writeBytes(Streams.java:266) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:406) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.bytes.BytesReference.writeTo(BytesReference.java:68) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.io.stream.StreamOutput.writeBytesReference(StreamOutput.java:150) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.index.get.GetResult.writeTo(GetResult.java:365) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.get.GetResponse.writeTo(GetResponse.java:201) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.get.MultiGetShardResponse.writeTo(MultiGetShardResponse.java:89) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1243) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1199) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1178) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:60) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:295) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:287) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1553) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.6.4.jar:5.6.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions