Skip to content

Failed data nodes due to Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed #32526

@TinLe

Description

@TinLe

Elasticsearch version: 6.2.3

Plugins installed: [x-pack-deprecation,x-pack-graph,x-pack-logstash,x-pack-ml,x-pack-monitoring,x-pack-security,x-pack-upgrade,x-pack-watcher]

JVM version :
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)

OS version :
3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

We have had 2 data nodes crashed and restarted today with the following in the log (see below).

I wonder if this is similar to #23099 and #25016

I do see a lot of these immediately in the logs preceeding the crash, several days worth of these since last time node was restarted.

[2018-07-31T17:22:21,640][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:23:11,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:01,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:31,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:25:21,643][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data

Steps to reproduce:

The cluster has been running normally, no changes.

55 nodes
780 indices
2976 primary shards
5642 replica shards

Provide logs (if relevant):

[2018-07-31T17:26:41,991][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [dl1-dn00141-d2] fatal error in thread [elasticsearch[dl1-dn00141-d2][bulk][T#12]], exiting java.lang.AssertionError: Unexpected AlreadyClosedException at org.elasticsearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:1786) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:1803) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:905) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:738) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:707) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:680) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:518) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:480) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:466) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:567) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:530) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2315) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2293) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2292) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:641) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:513) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1667) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.translog.Translog.add(Translog.java:508) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:887) ~[elasticsearch-6.2.3.jar:6.2.3] ... 24 more [2018-07-31T23:13:30,678][INFO ][o.e.n.Node ] [dl1-dn00141-d2] initializing ... [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] using [4] data paths, mounts [[/data8 (/dev/sdh), /data6 (/dev/sdf), /data7 (/dev/sdg), /data5 (/dev/sde)]], net usable_space [32.2tb], net total_space [36.3tb], types [xfs] [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] heap size [29gb], compressed ordinary object pointers [true] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] node name [dl1-dn00141-d2], node ID [MDYk1hFyS02s1Jxq6NNa7g] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] version[6.2.3], pid[27304], build[c59ff00/2018-03-13T10:06:29.741383Z], OS[Linux/3.10.0-862.3.3.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_171/25.171-b10] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] JVM arguments [-Xms30g, -Xmx30g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=65, -XX:+UseCMSInitiatingOccupancyOnly, -XX:ParallelGCThreads=10, -XX:ConcGCThreads=5, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Xloggc:/data5/elasticsearch/logs/datanode2/gc.log, -XX:-UsePerfData, -XX:SurvivorRatio=4, -XX:NewSize=6g, -XX:MaxNewSize=6g, -XX:+UnlockDiagnosticVMOptions, -XX:ParGCCardsPerStrideChunk=32768, -XX:MaxTenuringThreshold=8, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintClassHistogram, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -XX:+PrintPromotionFailure, -XX:PrintFLSStatistics=2, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=10, -XX:GCLogFileSize=512M, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch/datanode2]

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Indexing/EngineAnything around managing Lucene and the Translog in an open shard.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions