-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch version (bin/elasticsearch --version):
Version: 7.6.1, Build: default/rpm/aa751e09be0a5072e8570670309b1f12348f023b/2020-02-29T00:15:25.529771Z, JVM: 13.0.2
Plugins installed: []
JVM version (java -version):
openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.2+8)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.2+8, mixed mode, sharing)
OS version (uname -a if on a Unix-like system):
Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Testing a disaster recovery scenario with ccr I found two different behaviours unfollowing indices when the leader is not available (remote cluster is down)
-
The unfollow call returns immediately with connect_transport_exception exceptions
-
The unfollow call never returns (i waited for more than an hour)
The cluster shows the unfollow task running but never dies.
This second case seems to happens with indices with number_shards > 1
Expected behavior: both cases returning with status.
Steps to reproduce:
To reproduce I did a fresh install of 2 clusters with 3 nodes each with latest version of ES (rpm)
elasticsearch.yml :
cluster.name: primary
node.name: node1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: _eth0_
http.port: 9200
discovery.seed_hosts: ["192.168.1.219", "192.168.1.222", "192.168.1.223"]
cluster.initial_master_nodes: ["node1", "node2", "node3"]
gateway.recover_after_nodes: 2
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
Steps:
#Start trial on both cluster
curl -X POST "192.168.1.219:9200/_license/start_trial?acknowledge=true&pretty" -u elastic:badpassword
curl -X POST "192.168.1.226:9200/_license/start_trial?acknowledge=true&pretty" -u elastic:badpassword
#remote cluster connection
curl -u elastic:badpassword -X PUT "192.168.1.226:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"persistent": {
"cluster": {
"remote": {
"remote-primary": {
"seeds": [
"192.168.1.219:9300"
],
"transport.ping_schedule": "30s"
}
}
}
}
}
'
#Verify remote
curl -XGET 'http://192.168.1.226:9200/_remote/info' -u elastic:badpassword
#Create leader index with 1 shard on primary cluster
curl -u elastic:badpassword -X PUT "192.168.1.219:9200/testing-11?pretty" -H 'Content-Type: application/json' -d'
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
}
}
}
'
#Create leader index with 2 shards on primary cluster
curl -u elastic:badpassword -X PUT "192.168.1.219:9200/testing-21?pretty" -H 'Content-Type: application/json' -d'
{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
}
}
}
'
#Verify new indices
curl -XGET 'http://192.168.1.219:9200/_cat/indices/test*?v' -u elastic:badpassword
#Create followers on secondary cluster
curl -u elastic:badpassword -X PUT "192.168.1.226:9200/testing-11/_ccr/follow?wait_for_active_shards=1&pretty" -H 'Content-Type: application/json' -d'
{
"remote_cluster" : "remote-primary",
"leader_index" : "testing-11"
}
'
curl -u elastic:badpassword -X PUT "192.168.1.226:9200/testing-21/_ccr/follow?wait_for_active_shards=1&pretty" -H 'Content-Type: application/json' -d'
{
"remote_cluster" : "remote-primary",
"leader_index" : "testing-21"
}
'
#Verify followers
curl -XGET 'http://192.168.1.226:9200/_cat/indices/test*?v' -u elastic:badpassword
curl -XGET 'http://192.168.1.226:9200/_ccr/stats?pretty' -u elastic:badpassword
#Shutdown primary cluster nodes
systemctl stop elasticsearch
#Verify ccr status
curl -XGET 'http://192.168.1.226:9200/_remote/info' -u elastic:badpassword
curl -XGET 'http://192.168.1.226:9200/_ccr/stats?pretty&filter_path=follow_stats.indices.shards.read_exceptions.*' -u elastic:badpassword
#Pause following on both indices and verify
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_ccr/pause_follow?pretty"
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_ccr/pause_follow?pretty"
curl -XGET 'http://192.168.1.226:9200/testing-*/_ccr/info?pretty' -u elastic:badpassword
#Close
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_close?pretty"
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_close?pretty"
curl -XGET 'http://192.168.1.226:9200/_cat/indices/test*?v' -u elastic:badpassword
#unfollow
#this call to unfollow return a connect_transport_exception
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_ccr/unfollow?pretty"
#this call to unfollow never returns
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_ccr/unfollow?pretty"
#Verify long running tasks
curl -XGET 'http://192.168.1.226:9200/_cat/tasks?v' -u elastic:badpassword
action task_id parent_task_id type start_time timestamp running_time ip node
indices:admin/xpack/ccr/unfollow V-frNb4_Sc2RTGZ_uYne-g:665 - transport 1583414050796 13:14:10 2h 192.168.1.226 node3s
indices:admin/xpack/ccr/unfollow dm-h5Zx7TxCUoOvn1I_DWQ:1208 V-frNb4_Sc2RTGZ_uYne-g:665 transport 1583414050798 13:14:10 2h 192.168.1.224 node2s
cluster:monitor/tasks/lists V-frNb4_Sc2RTGZ_uYne-g:8609 - transport 1583421254522 15:14:14 756.7micros 192.168.1.226 node3s
cluster:monitor/tasks/lists[n] V-frNb4_Sc2RTGZ_uYne-g:8610 V-frNb4_Sc2RTGZ_uYne-g:8609 direct 1583421254523 15:14:14 241.5micros 192.168.1.226 node3s
cluster:monitor/tasks/lists[n] dm-h5Zx7TxCUoOvn1I_DWQ:16577 V-frNb4_Sc2RTGZ_uYne-g:8609 transport 1583421254530 15:14:14 247.2micros 192.168.1.224 node2s
cluster:monitor/tasks/lists[n] YIzbKYCoQ-SDyJTkAVSBQg:8917 V-frNb4_Sc2RTGZ_uYne-g:8609 transport 1583421254533 15:14:14 343.3micros 192.168.1.225 node1s
Provide logs (if relevant):