-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch version: 6.2.3
Plugins installed: [x-pack-deprecation,x-pack-graph,x-pack-logstash,x-pack-ml,x-pack-monitoring,x-pack-security,x-pack-upgrade,x-pack-watcher]
JVM version :
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)
OS version :
3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
We have had 2 data nodes crashed and restarted today with the following in the log (see below).
I wonder if this is similar to #23099 and #25016
I do see a lot of these immediately in the logs preceeding the crash, several days worth of these since last time node was restarted.
[2018-07-31T17:22:21,640][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:23:11,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:01,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:31,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:25:21,643][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data
Steps to reproduce:
The cluster has been running normally, no changes.
55 nodes
780 indices
2976 primary shards
5642 replica shards
Provide logs (if relevant):
[2018-07-31T17:26:41,991][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [dl1-dn00141-d2] fatal error in thread [elasticsearch[dl1-dn00141-d2][bulk][T#12]], exiting java.lang.AssertionError: Unexpected AlreadyClosedException at org.elasticsearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:1786) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:1803) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:905) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:738) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:707) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:680) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:518) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:480) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:466) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:567) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:530) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2315) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2293) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2292) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:641) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:513) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1667) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.translog.Translog.add(Translog.java:508) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:887) ~[elasticsearch-6.2.3.jar:6.2.3] ... 24 more [2018-07-31T23:13:30,678][INFO ][o.e.n.Node ] [dl1-dn00141-d2] initializing ... [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] using [4] data paths, mounts [[/data8 (/dev/sdh), /data6 (/dev/sdf), /data7 (/dev/sdg), /data5 (/dev/sde)]], net usable_space [32.2tb], net total_space [36.3tb], types [xfs] [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] heap size [29gb], compressed ordinary object pointers [true] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] node name [dl1-dn00141-d2], node ID [MDYk1hFyS02s1Jxq6NNa7g] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] version[6.2.3], pid[27304], build[c59ff00/2018-03-13T10:06:29.741383Z], OS[Linux/3.10.0-862.3.3.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_171/25.171-b10] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] JVM arguments [-Xms30g, -Xmx30g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=65, -XX:+UseCMSInitiatingOccupancyOnly, -XX:ParallelGCThreads=10, -XX:ConcGCThreads=5, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Xloggc:/data5/elasticsearch/logs/datanode2/gc.log, -XX:-UsePerfData, -XX:SurvivorRatio=4, -XX:NewSize=6g, -XX:MaxNewSize=6g, -XX:+UnlockDiagnosticVMOptions, -XX:ParGCCardsPerStrideChunk=32768, -XX:MaxTenuringThreshold=8, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintClassHistogram, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -XX:+PrintPromotionFailure, -XX:PrintFLSStatistics=2, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=10, -XX:GCLogFileSize=512M, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch/datanode2]