-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Shallow Snapshots V2 - Scaling Snapshot Operations #15083
Comments
Thanks @anshu1106 for the RFC. I have few comments/suggestions:
Some of the GET requests today directly reads the files stored in the repository. with this, we will be adding some overhead of checking the presence of pinned timestamp as well which would mean an additional get or list call on remote store repository.
this would mean, in case of failure we will have to delete We can avoid all this by introducing a cluster state entry for the snapshot which would handle master swap issues and that way we can update the repository data and other md files at the end. as the operations are happening only in master node and at index level only, it should not be that costly.
|
Yeah thats correct. The rationale behind this is to make the create operation faster since it can be executed more frequently and can run in automated manner.
Correct, we did spend some time on this. It makes sense to go ahead without cluster state entry to reduce the response time of create snapshot operation since cluster state update may have to compete with other higher priority tasks. But looks like, without cluster state entry we may lose upon some out of the box concurrency checks that is currently supported by snapshots. It makes sense to have a cluster state entry to avoid conflicting concurrent operations and as you mentioned will be helpful during cluster manager switch scenarios.
Will add the details. |
Objective
Shallow copy snapshots are currently taken for remote store enabled clusters, capturing references to data stored in a remote store rather than copying all the data. This approach makes the snapshot process faster and more efficient, as it doesn’t require transferring large amounts of data. In a shallow copy snapshot, a reference to the remote store metadata file is stored in the snapshot shard metadata.
In this RFC we discuss the shortcomings of the current snapshot flow and propose mechanisms to make snapshot operations scale independently of the number of shards in the cluster.
Issues with the current snapshot flow
For create snapshot, there are two communication channels between the cluster manager and all other nodes
The following image depicts the interaction between the cluster manager node, the data nodes, and the ClusterState object described above.
This two-way communication overhead increases as the number of shards grows.
Non-deterministic snapshot creation duration - The cluster state update tasks for shard status updates can be delayed by higher priority tasks, making snapshot creation duration unpredictable.
Excessive locking overhead
Requirements
Proposed Solution
Centralize Snapshot Creation with Timestamp pinning
Since the shallow snapshot keeps a reference to remote store segment metadata in snapshot metadata and doesn’t require data nodes to upload the segment files, we propose to move snapshot checkpointing to cluster manager. This eliminates all the communication overhead between data nodes and the cluster manager node. This can also remove the need for per shard snapshot status to be maintained in cluster state.
Timestamp Pinning - Implement Timestamp pinning to replace the explicit locking mechanism. With Timestamp Pinning, there is no need to create a lock file for each shard. Instead, timestamps are used to implicitly lock segment files referenced by the snapshot and prevent deletion during GC. This approach reduces the number of remote store calls for locks during snapshot creation and deletion.
High Level Design Details
Create Snapshot Operation
For all operations, create snapshot is considered successful only if a corresponding pinned timestamp is found for the snapshot. GET operations require a filter to check for the presence of pinned timestamp for shallow snapshots v2 to mark the snapshot status as SUCCESSFUL.
Benefits of the proposed approach
Additional Considerations
Timestamp pinning to be done after uploading snapshot metadata files
If cluster manager switch happens after timestamp pinning but before uploading snapshot metadata files, we do need a mechanism to let new cluster manager know about the state. To avoid overhead of additional state management, we plan to go with updating pinned timestamp after snapshot finalization. In case pinned timestamp update fails, we can delete the metadata files uploaded in the snapshot repo and fail the snapshot.
We don’t need to write snapshot shard level metadata file
Handling Cluster Manager failure
The snapshot request is acknowledged only after completion and verification that no cluster manager switch occurred during the process. If the cluster manager term and version change during the snapshot operation, the request fails
The text was updated successfully, but these errors were encountered: