Skip to content

Shadow replicas (segment-based replication) #8976

@dakrone

Description

@dakrone

Introduction

This is intended to be a high-level design of the shadow replicas feature. We
will attempt to describe how the shadow replicas will be designed, the steps
required to develop this feature, and any other design considerations.

Shadow replicas are the name for segment-based replication inside of
Elasticsearch, where instead of replicating documents on a document-by-document
level (indexing the document on both the primary and all replicas), we will
index the document only on the primary, and then replicate on-disk segments to
replica copies of the shard.

Design steps

There are three major stages in this process, and each stage leads into the next
stage.

Stage 1: Custom data paths on per-index level

In order to easily support a cluster having both regular and shadow replicas, we
need the ability to change the location that certain indices look for their
data. This is so that some indices can be configured to use a mounted shared
filesystem and some can be configured to use the regular path.data data path.

An additional feature of this we would like to support is allowing the user to
specify the template for what the data path should look like. This is useful to
conform to some filesystem requirements with regard to naming.

An example of what this could look like, when creating a new index:

POST /myindex
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "shadow_replicas": true,
    "data_path": "/tmp/myindex",
    "data_template": "node_{{node_id}}/i_{{index_name}}/shard_{{shard_num}}"
  }
}

Which would then produce a directory structure that would look like:

$ tree /tmp/myindex/node_0
node_0
└── i_myindex
    └── data
        ├── shard_0
        |   ├── _state
        |   │   └── state-4.st
        |   ├── index
        |   │   ├── segments_1
        |   │   └── write.lock
        |   └── translog
        |       └── translog-1416309506087
        ├── shard_1
        |   └── ... etc ...
        ├── shard_2
        |   └── ... etc ...
        ├── shard_3
        |   └── ... etc ...
        └── shard_4
            └── ... etc ...

The custom template in this example would use our existing Mustache template
library, since it is already part of Elasticsearch.

Because this is a potentially dangerous setting from a security perspective.
This will require the user to set the node.enable_custom_paths setting in
elasticsearch.yml on all nodes before data_path or data_template can be
used.

Stage 2: Implement shadow replicas assuming a shared file system

If we assume that the data in the index path will already be shared across
multiple nodes, we can create and index with shadow replicas, where each replica
shard simply contains an IndexReader that periodically refreshes to pick up
new segments.

All indexing operations will be executed on the primary shard, and will not be
replicated to each replica, since the data will be replicated in a different
way.

During this phase, creating an index with index.shadow_replicas: true and
number_of_replicas greater than 0 will cause operations not to undergo
replication to replica shards. An index can have either regular replicas or
shadow replicas; they are mutually exclusive for an index. The
index.shadow_replicas setting is set at index creation time and cannot be
changed dynamically.

The Elasticsearch cluster will still detect the loss of a primary shard, and
transform the replica into a primary in this situation. This transformation will
take slightly longer, since no IndexWriter will be maintained for each shadow
replica.

In order to ensure the data is being synchronized in a fast enough manner, The
user will need to tune the flush threshold for the index to a desired number. A
flush is needed to fsync segment files to disk, so they will be visible to all
other replica nodes. Users should test what flush threshold levels they are
comfortable with, as increased flushing can impact indexing performance. This
testing can be performed at any time, there is no need to wait for this feature
to be available first.

Once segments are available on the filesystem where the shadow replica resides,
a regular refresh (governed by the index.refresh_interval) can be used to make
the new data searchable.

Stage 3: Implement shadow replicas for regular file systems

For the last stage, we will perform the actual recovery of segments that are
created on the source node and need to be sent to the target node. The intent is
that in the case of a shared filesystem, we will still perform this check,
however the list of segment files will always be in sync, so there will be no
need to copy over missing files.

For the case of a regular file system, we will re-use our existing recovery code
to check for missing segments and send them the same way we usually would during
a regular shard recovery.

In this case, we hope not to have a distinction between a shared filesystem and
a regular filesystem, the only difference on a shared filesystem will be that
when we check for missing segments, no segments will be missing.

Benefits

  • We don't incur the cost of indexing (analysis, etc) more than once
  • Recovery is faster because segments will be in sync on the primary and replica

Costs

  • Search and retrieval are not real time
  • If the primary fails, documents in the translog will be lost unless shared
    storage is used
  • File corruptions might be copied to replicas, we don't have an alternative
    source of data
  • Increased network I/O, because we have to copy larger segments once they have
    been merged, even if no documents have changed
  • More frequent flushing (lucene commit) could impact indexing speed
  • Fail-over could be slower because we don't maintain everything needed for a
    replica to quickly switch to a primary

Subsequent stories/PRs:

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions