-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
https://discuss.elastic.co/t/circuit-breaker-always-trips/109067
Elasticsearch version (bin/elasticsearch --version):
Version: 5.6.4, Build: 8bbedf5/2017-10-31T18:55:38.105Z, JVM: 1.8.0_144
Plugins installed: [analysis-icu]
JVM version (java -version):
openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)
OS version (uname -a if on a Unix-like system):
FreeBSD fe 11.1-STABLE FreeBSD 11.1-STABLE #0 r324684: Tue Oct 17 15:07:45 CEST 2017 root@builder:/usr/obj/usr/src/sys/GENERIC amd64
Description of the problem including expected versus actual behavior:
Circuit breakers' size constantly grow after a short period of uptime. This happens (for now) only on two machines, which may be because of replication.
After the limit is reached, even a
curl http://localhost:9200/ fails with:
{
"error":{
"root_cause":[
{
"type":"circuit_breaking_exception",
"reason":"[parent] Data too large, data for [<http_request>] would be [13610582016/12.6gb], which is larger than the limit of [11885484441/11gb]",
"bytes_wanted":13610582016,
"bytes_limit":11885484441
}
],
"type":"circuit_breaking_exception",
"reason":"[parent] Data too large, data for [<http_request>] would be [13610582016/12.6gb], which is larger than the limit of [11885484441/11gb]",
"bytes_wanted":13610582016,
"bytes_limit":11885484441
},
"status":503
}With the default configuration, the cluster remains operational for some time. When it reaches the request breaker limit, all shards residing on the two failing machines become essentially unavailable.
After some time the failing nodes get dropped out and reconnect, but it can't automatically heal.
When I raise the breakers' limit to 2^63-1, the cluster remains operational, but the breaker size grows indefintely (growing around 160 GiB in 8 hours).
Steps to reproduce:
It is 100% reproduceable on our cluster. More hints below.
I need help (maybe a debug build) to figure out what causes it.
Provide logs (if relevant):
I guess the root cause is that we have a too big multiget, which fails. It may be that this exception is not handled well and the 2 GiBs of size remains in the circuit breaker counter.
It would be pretty nice to log at least the mget doc _ids along with the following exception, so it would make easier to find out what docs have the problem.
[2017-11-25T08:06:18,532][DEBUG][o.e.a.g.TransportShardMultiGetAction] [fe00] null: failed to execute [org.elasticsearch.action.get.MultiGetShardRequest@165b2817]
org.elasticsearch.transport.RemoteTransportException: [fe32][10.6.145.237:9300][indices:data/read/mget[shard][s]]
Caused by: java.lang.IllegalArgumentException: ReleasableBytesStreamOutput cannot hold more than 2GB of data
at org.elasticsearch.common.io.stream.BytesStreamOutput.ensureCapacity(BytesStreamOutput.java:155) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.common.io.stream.ReleasableBytesStreamOutput.ensureCapacity(ReleasableBytesStreamOutput.java:69) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.common.io.stream.BytesStreamOutput.writeBytes(BytesStreamOutput.java:89) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.common.io.Streams$FlushOnCloseOutputStream.writeBytes(Streams.java:266) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:406) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.common.bytes.BytesReference.writeTo(BytesReference.java:68) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.common.io.stream.StreamOutput.writeBytesReference(StreamOutput.java:150) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.index.get.GetResult.writeTo(GetResult.java:365) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.action.get.GetResponse.writeTo(GetResponse.java:201) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.action.get.MultiGetShardResponse.writeTo(MultiGetShardResponse.java:89) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1243) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1199) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1178) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:67) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:60) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:295) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:287) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1553) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.6.4.jar:5.6.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.6.4.jar:5.6.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]