Skip to content

unfollow call not returning when remote cluster is not available #53174

@jepert

Description

@jepert

Elasticsearch version (bin/elasticsearch --version):

Version: 7.6.1, Build: default/rpm/aa751e09be0a5072e8570670309b1f12348f023b/2020-02-29T00:15:25.529771Z, JVM: 13.0.2

Plugins installed: []

JVM version (java -version):

openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.2+8)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.2+8, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):

Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
Testing a disaster recovery scenario with ccr I found two different behaviours unfollowing indices when the leader is not available (remote cluster is down)

  1. The unfollow call returns immediately with connect_transport_exception exceptions

  2. The unfollow call never returns (i waited for more than an hour)
    The cluster shows the unfollow task running but never dies.
    This second case seems to happens with indices with number_shards > 1

Expected behavior: both cases returning with status.

Steps to reproduce:
To reproduce I did a fresh install of 2 clusters with 3 nodes each with latest version of ES (rpm)

elasticsearch.yml :

cluster.name: primary
node.name: node1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: _eth0_
http.port: 9200
discovery.seed_hosts: ["192.168.1.219", "192.168.1.222", "192.168.1.223"]
cluster.initial_master_nodes: ["node1", "node2", "node3"]
gateway.recover_after_nodes: 2
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12

Steps:

#Start trial on both cluster

curl -X POST "192.168.1.219:9200/_license/start_trial?acknowledge=true&pretty" -u elastic:badpassword
curl -X POST "192.168.1.226:9200/_license/start_trial?acknowledge=true&pretty" -u elastic:badpassword

#remote cluster connection

curl -u elastic:badpassword -X PUT "192.168.1.226:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster": {
      "remote": {
        "remote-primary": {
          "seeds": [
            "192.168.1.219:9300"
          ],
          "transport.ping_schedule": "30s"
        }
        }
      }
    }
  }
'

#Verify remote

curl -XGET  'http://192.168.1.226:9200/_remote/info' -u elastic:badpassword 

#Create leader index with 1 shard on primary cluster

curl -u elastic:badpassword -X PUT "192.168.1.219:9200/testing-11?pretty" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 1 
        }
    }
}
'

#Create leader index with 2 shards on primary cluster

curl -u elastic:badpassword -X PUT "192.168.1.219:9200/testing-21?pretty" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 2, 
            "number_of_replicas" : 1 
        }
    }
}
'

#Verify new indices

curl -XGET 'http://192.168.1.219:9200/_cat/indices/test*?v' -u elastic:badpassword

#Create followers on secondary cluster

curl -u elastic:badpassword -X PUT "192.168.1.226:9200/testing-11/_ccr/follow?wait_for_active_shards=1&pretty" -H 'Content-Type: application/json' -d'
{
  "remote_cluster" : "remote-primary",
  "leader_index" : "testing-11"
}
'


curl -u elastic:badpassword -X PUT "192.168.1.226:9200/testing-21/_ccr/follow?wait_for_active_shards=1&pretty" -H 'Content-Type: application/json' -d'
{
  "remote_cluster" : "remote-primary",
  "leader_index" : "testing-21"
}
'

#Verify followers

curl -XGET  'http://192.168.1.226:9200/_cat/indices/test*?v' -u elastic:badpassword
curl -XGET  'http://192.168.1.226:9200/_ccr/stats?pretty' -u elastic:badpassword

#Shutdown primary cluster nodes

systemctl stop elasticsearch

#Verify ccr status

curl -XGET  'http://192.168.1.226:9200/_remote/info' -u elastic:badpassword
curl -XGET  'http://192.168.1.226:9200/_ccr/stats?pretty&filter_path=follow_stats.indices.shards.read_exceptions.*' -u elastic:badpassword

#Pause following on both indices and verify

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_ccr/pause_follow?pretty"
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_ccr/pause_follow?pretty"
curl -XGET  'http://192.168.1.226:9200/testing-*/_ccr/info?pretty' -u elastic:badpassword

#Close

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_close?pretty"
curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_close?pretty"
curl -XGET  'http://192.168.1.226:9200/_cat/indices/test*?v' -u elastic:badpassword

#unfollow

#this call to unfollow return a connect_transport_exception

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-11/_ccr/unfollow?pretty"

#this call to unfollow never returns

curl -u elastic:badpassword -X POST "192.168.1.226:9200/testing-21/_ccr/unfollow?pretty"

#Verify long running tasks

curl -XGET  'http://192.168.1.226:9200/_cat/tasks?v' -u elastic:badpassword
action                           task_id                      parent_task_id              type      start_time    timestamp running_time ip            node
indices:admin/xpack/ccr/unfollow V-frNb4_Sc2RTGZ_uYne-g:665   -                           transport 1583414050796 13:14:10  2h           192.168.1.226 node3s
indices:admin/xpack/ccr/unfollow dm-h5Zx7TxCUoOvn1I_DWQ:1208  V-frNb4_Sc2RTGZ_uYne-g:665  transport 1583414050798 13:14:10  2h           192.168.1.224 node2s
cluster:monitor/tasks/lists      V-frNb4_Sc2RTGZ_uYne-g:8609  -                           transport 1583421254522 15:14:14  756.7micros  192.168.1.226 node3s
cluster:monitor/tasks/lists[n]   V-frNb4_Sc2RTGZ_uYne-g:8610  V-frNb4_Sc2RTGZ_uYne-g:8609 direct    1583421254523 15:14:14  241.5micros  192.168.1.226 node3s
cluster:monitor/tasks/lists[n]   dm-h5Zx7TxCUoOvn1I_DWQ:16577 V-frNb4_Sc2RTGZ_uYne-g:8609 transport 1583421254530 15:14:14  247.2micros  192.168.1.224 node2s
cluster:monitor/tasks/lists[n]   YIzbKYCoQ-SDyJTkAVSBQg:8917  V-frNb4_Sc2RTGZ_uYne-g:8609 transport 1583421254533 15:14:14  343.3micros  192.168.1.225 node1s

Provide logs (if relevant):

secondary-v761-node1s.log

secondary-v761-node2s.log

secondary-v761-node3s.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Indexing/CCRIssues around the Cross Cluster State Replication features>bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions