[RFC] Shallow Snapshots V2 - Scaling Snapshot Operations #15083

anshu1106 · 2024-08-02T14:49:58Z

Objective

Shallow copy snapshots are currently taken for remote store enabled clusters, capturing references to data stored in a remote store rather than copying all the data. This approach makes the snapshot process faster and more efficient, as it doesn’t require transferring large amounts of data. In a shallow copy snapshot, a reference to the remote store metadata file is stored in the snapshot shard metadata.

This RFC covers changes to shallow snapshot flow with Timestamp Pinning. There is a separate RFC on Timestamp Pinning.

In this RFC we discuss the shortcomings of the current snapshot flow and propose mechanisms to make snapshot operations scale independently of the number of shards in the cluster.

Issues with the current snapshot flow

Communication overhead

For create snapshot, there are two communication channels between the cluster manager and all other nodes

The cluster manager updates the ClusterState object by adding, removing, or altering the contents of its custom entry SnapshotsInProgress. All nodes consume the state of the SnapshotInProgress and start or terminate the relevant shard snapshot tasks accordingly.
Nodes executing shard snapshot tasks report either success or failure of their snapshot tasks by submitting an UpdateIndexShardSnapshotStatusRequest to the cluster manager node, which updates the snapshot’s entry in the ClusterState object accordingly. Corresponding to each shard status, a cluster state update task is created with priority set as normal.

The following image depicts the interaction between the cluster manager node, the data nodes, and the ClusterState object described above.

This two-way communication overhead increases as the number of shards grows.

Non-deterministic snapshot creation duration - The cluster state update tasks for shard status updates can be delayed by higher priority tasks, making snapshot creation duration unpredictable.
Excessive locking overhead
- A lock file is created per segment metadata file to prevent deletion of data referenced by shallow snapshots.
- For a cluster with 100K shards, each snapshot creates 100K lock files, even if only 100 shards were actively receiving indexing throughput.
- This increases remote store calls proportionally to the number of shards during snapshot creation and deletion.

Requirements

Make snapshot operations more deterministic.
Minimize the number of cluster state updates.
Scale snapshot operations independently of the number of shards.

Proposed Solution

Centralize Snapshot Creation with Timestamp pinning

Since the shallow snapshot keeps a reference to remote store segment metadata in snapshot metadata and doesn’t require data nodes to upload the segment files, we propose to move snapshot checkpointing to cluster manager. This eliminates all the communication overhead between data nodes and the cluster manager node. This can also remove the need for per shard snapshot status to be maintained in cluster state.

Timestamp Pinning - Implement Timestamp pinning to replace the explicit locking mechanism. With Timestamp Pinning, there is no need to create a lock file for each shard. Instead, timestamps are used to implicitly lock segment files referenced by the snapshot and prevent deletion during GC. This approach reduces the number of remote store calls for locks during snapshot creation and deletion.

High Level Design Details

Create Snapshot Operation

Snapshot creation operation is executed entirely on the cluster manager node
Perform all existing validations (snapshot name, UUID creation, repository checks, etc.).
Evaluate the timestamp to be pinned.
Prepare an in-memory list of all the indices to be snapshotted and the cluster metadata (if include_global_state is set to true in create snapshot request).
Finalize the snapshot (can be parallelized)
1. Count totals shards for snapshot info details (Shard failure always set to 0).
2. Write global metadata, IndexMetadata and Snapshot info to the snapshot repo
3. Update repository data with the new snapshot .
4. Update snapshot repository generation.
Write the pinned timestamp for implicit locking.
If timestamp pinning fails after retries, asynchronously delete the files uploaded in the snapshot repo.
Fail snapshot if the cluster manager switched since the start of the snapshot operation.
Send success/failure in the response.

For all operations, create snapshot is considered successful only if a corresponding pinned timestamp is found for the snapshot. GET operations require a filter to check for the presence of pinned timestamp for shallow snapshots v2 to mark the snapshot status as SUCCESSFUL.

Benefits of the proposed approach

Snapshot creation time can be more deterministic and independent of number of shards
Automated snapshot creation can never be partial since we can account for last successful segment file for checkpointing.
Create snapshot can be successful even if few indices are red. It will be able to snapshot last successful state.
We don’t require a flush to be triggered per index before snapshotting. In order to ensure that data is not lost for indices with higher refresh interval, remote segment + remote translog provides complete state of successfully indexed data at any given time. Garbage collection of segments as well as translog will skip deletion of metadata files that match pinned timestamp.

Additional Considerations

Timestamp pinning to be done after uploading snapshot metadata files

If cluster manager switch happens after timestamp pinning but before uploading snapshot metadata files, we do need a mechanism to let new cluster manager know about the state. To avoid overhead of additional state management, we plan to go with updating pinned timestamp after snapshot finalization. In case pinned timestamp update fails, we can delete the metadata files uploaded in the snapshot repo and fail the snapshot.

We don’t need to write snapshot shard level metadata file

Deferring resolution of snapshot timestamp to segment and translog can help make create operation faster.
LIST operation for scanning through the segments and translogs for timestamp resolution is costly and can be deferred to restore.
Delete will become faster as it doesn’t require to delete per shard snapshot metadata file.
Details stored in per shard metadata file can be calculated on demand from segment metadata in remote store.

Handling Cluster Manager failure

The snapshot request is acknowledged only after completion and verification that no cluster manager switch occurred during the process. If the cluster manager term and version change during the snapshot operation, the request fails

harishbhakuni · 2024-08-22T07:20:58Z

Thanks @anshu1106 for the RFC. I have few comments/suggestions:

Why are we not using cluster state entry for this?

For all operations, create snapshot is considered successful only if a corresponding pinned timestamp is found for the snapshot. GET operations require a filter to check for the presence of pinned timestamp for shallow snapshots v2 to mark the snapshot status as SUCCESSFUL.

Some of the GET requests today directly reads the files stored in the repository. with this, we will be adding some overhead of checking the presence of pinned timestamp as well which would mean an additional get or list call on remote store repository.

In case pinned timestamp update fails, we can delete the metadata files uploaded in the snapshot repo and fail the snapshot.

this would mean, in case of failure we will have to delete number of indices + 2 md files and also update the repository data of the snapshot repository. if we do this asynchronously and repository data update fails and another snapshot gets triggered. repository data can end up having stale snapshot entries. How will we take care of removing those entries?

We can avoid all this by introducing a cluster state entry for the snapshot which would handle master swap issues and that way we can update the repository data and other md files at the end. as the operations are happening only in master node and at index level only, it should not be that costly.

Details about other snapshot operations is not mentioned here. are we planning to add those separately in another doc? mainly curious how snapshot deletion will work and what all will it need to do.

anshu1106 · 2024-08-29T13:17:09Z

Thanks @anshu1106 for the RFC. I have few comments/suggestions:

Why are we not using cluster state entry for this?

For all operations, create snapshot is considered successful only if a corresponding pinned timestamp is found for the snapshot. GET operations require a filter to check for the presence of pinned timestamp for shallow snapshots v2 to mark the snapshot status as SUCCESSFUL.

Some of the GET requests today directly reads the files stored in the repository. with this, we will be adding some overhead of checking the presence of pinned timestamp as well which would mean an additional get or list call on remote store repository.

Yeah thats correct. The rationale behind this is to make the create operation faster since it can be executed more frequently and can run in automated manner.

In case pinned timestamp update fails, we can delete the metadata files uploaded in the snapshot repo and fail the snapshot.

this would mean, in case of failure we will have to delete number of indices + 2 md files and also update the repository data of the snapshot repository. if we do this asynchronously and repository data update fails and another snapshot gets triggered. repository data can end up having stale snapshot entries. How will we take care of removing those entries?

We can avoid all this by introducing a cluster state entry for the snapshot which would handle master swap issues and that way we can update the repository data and other md files at the end. as the operations are happening only in master node and at index level only, it should not be that costly.

Correct, we did spend some time on this. It makes sense to go ahead without cluster state entry to reduce the response time of create snapshot operation since cluster state update may have to compete with other higher priority tasks. But looks like, without cluster state entry we may lose upon some out of the box concurrency checks that is currently supported by snapshots. It makes sense to have a cluster state entry to avoid conflicting concurrent operations and as you mentioned will be helpful during cluster manager switch scenarios.

Details about other snapshot operations is not mentioned here. are we planning to add those separately in another doc? mainly curious how snapshot deletion will work and what all will it need to do.

Will add the details.

anshu1106 added enhancement Enhancement or improvement to existing feature or request untriaged Storage:Snapshots labels Aug 2, 2024

github-project-automation bot added this to Storage Project Board and OpenSearch Roadmap Aug 2, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Aug 2, 2024

github-project-automation bot moved this to Issues and PR's in OpenSearch Roadmap Aug 2, 2024

anshu1106 removed the untriaged label Aug 2, 2024

anshu1106 changed the title ~~[RFC] Shallow Snapshots V2~~ [RFC] Shallow Snapshots V2 - Scaling Snapshot Operations Aug 2, 2024

This was referenced Aug 2, 2024

[META] Snapshot V2 - Scaling Snapshot Operations #15084

Closed

[Snapshot-V2] Centralize snapshot creation and integrate with timestamp pinning #15123

Closed

andrross added the Roadmap:Cost/Performance/Scale Project-wide roadmap label label Aug 6, 2024

gbbafna moved this from 🆕 New to 👀 In review in Storage Project Board Aug 8, 2024

This was referenced Aug 14, 2024

[SnapshotV2] Support centralize snapshot creation #15124

Merged

[Snapshot V2] Integrate delete snapshot flow with pinned timestamp #15255

Closed

[Snapshot V2] Integrate clone snapshot flow with pinned timestamp #15316

Closed

This was referenced Sep 10, 2024

[DOC] Documentation for Shallow Snapshots v2 opensearch-project/documentation-website#8205

Closed

Add documentation changes for shallow snapshot v2 opensearch-project/documentation-website#8207

Merged

sachinpkale closed this as completed Sep 13, 2024

github-project-automation bot moved this from 👀 In review to ✅ Done in Storage Project Board Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Shallow Snapshots V2 - Scaling Snapshot Operations #15083

[RFC] Shallow Snapshots V2 - Scaling Snapshot Operations #15083

anshu1106 commented Aug 2, 2024 •

edited

Loading

harishbhakuni commented Aug 22, 2024

anshu1106 commented Aug 29, 2024

[RFC] Shallow Snapshots V2 - Scaling Snapshot Operations #15083

[RFC] Shallow Snapshots V2 - Scaling Snapshot Operations #15083

Comments

anshu1106 commented Aug 2, 2024 • edited Loading

Objective

Issues with the current snapshot flow

Requirements

Proposed Solution

Centralize Snapshot Creation with Timestamp pinning

High Level Design Details

Create Snapshot Operation

Additional Considerations

harishbhakuni commented Aug 22, 2024

anshu1106 commented Aug 29, 2024

anshu1106 commented Aug 2, 2024 •

edited

Loading