-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Use Pluggable translog for fetching the operations from leader #375
Comments
We do plan to move the method newChangesSnapshotFromTranslogFile out of Engine. Since we intend to move to Segment Replication eventually, we decided not to build extension support for translog fetch operations. We'll instead move these methods to Pluggable Translog. Since the pluggable translog has been moved out of 2.1 release to 3.0, we'll address this when the Pluggable Translog is ready. |
Ideally we want to move to Segment Replication by 3.0 with most logic implemented in the core. |
@nknize @dblock Let me know if you've ay concerns. We'll move the method newChangesSnapshotFromTranslogFile to pluggable translog which will go along with Pluggable translog in one of the 2.x release. |
🎉 |
No concerns here either. With segrep we will require the |
@Bukhtawar I've listed down the combinations of local replication and translog on leader.
For the next release atleast, changes done in PR should suffice. Eventually we'll need to these 2 additional handling on CCR side:
|
Summarizing the problem statement and overall approach for clarity & visibility. Problem statement: Action Items:
Plan for the action items:
Proposal for dependency on Pluggable Translog: One caveat here is that CCR currently load-balances the requests between primary and replicas on leader for fetching the changes. This might not work with Pluggable Translogs. For example:
Proposed changes:
|
@Bukhtawar Jotting down my thoughts based on our discussion today.
Adding folks for visibility. |
Thanks @ankitkala for the thorough details. Going to focus comments on pluggable translog. The final proposal seems fair. Couple questions
|
Thanks @krishna-ggk
The model could be same as the current flow. The way I envision this happening is decoupling the control flow and data flow. The control flow can still be the way it exists today with the existing security model, like follower fetching the leader's checkpoints while the data flow would latch on directly to remote store with a new data security model.
If would prefer a decoupled control and data flow to fetch data directly from remote translog store as described above. For additional context |
Thanks for expanding @Bukhtawar . Yes, agree on the benefits of directly querying remote store. Like you pointed out, the security model for the data flow would be key. The abstraction need to support varied stores with different permission model. This also raises back @ankitkala 's question on whether we need to expose the translog type (Remote/None/Local) or we can find a model that works across (the latter preferred ofcourse). |
Yep. This completely aligns with how I'm also thinking about security for fetching directly from leader's remote store (similar approach for Segments as well).
I'm slightly inclined towards having the translog type exposed but want to see inputs from @Bukhtawar first. One benefit that this provides is that CCR can be deterministically aware and figure out the mechanism needed to fetch the operations. Incase we don't want to expose the translog type from the
|
Can you please confirm if we need fetching operations from translog from leader cluster if segrep is enabled? Can we totally avoid the call in that case |
What is the user experience going to be? Currently, CCR leverages logical replication to copy data from leader to the follower index. For which it fetches the operations on the leader index from the translog. While, the proposed feature does not fundamentally change the experience of CCR. It adheres to better design principles and best practices which will ensure compatibility with future engine changes. By moving the fetching operation to the pluggable translog it provides the opportunity to develop - active-active replication, replication between incompatible OpenSearch versions, and upgrades of leader or follower index without breaking ongoing replication. The ability to fetch operations directly from the translog manager allows us to make CCR agnostic from the replication mechanism used inside the cluster as we also add support for segment replication and the location of the translog. |
If CCR is using SegRep, we won't even fetch operations from translog. But if CCR is on logical(and leader on SegRep), we'd still fetch the operations from translog. |
We've gone ahead with the changes with an assertion that we'd not fetch the changes from pluggable translog if leader is using SegRep or Remote Store. This should help in reducing the combinations that we'll be supporting. If we later want to allow a particular combination, we can enable for it explicitly. |
What are you proposing?
We're working on utilising Segment Replication for CCR and plan to make that as a default choice for replication. However, we aren't planning for deprecating the logical CCR as of now.
To support logical replication in the long run, we propose relying on Pluggable translog for fetching the operations for CCR(logical). More details here
Here are the key points:
Why support logical replication?
How did you come up with this proposal?
Follow up from opensearch-project/OpenSearch#1100.
What is the user experience going to be?
Cross-cluster replication (CCR) simplifies the process of copying data between multiple clusters. Users can use CCR to enable a remote cluster for the purpose of Disaster Recovery or for data proximity.
Currently, CCR leverages logical replication to copy data from leader to the follower index. For which it fetches the operations on the leader index from the translog.
Benefits of doing this change
While, the proposed feature does not fundamentally change the experience of CCR. It adheres to better design principles and best practices which will ensure compatibility with future engine changes. By moving the fetching operation to the pluggable translog it provides the opportunity to develop - active-active replication, replication between incompatible OpenSearch versions, and upgrades of leader or follower index without breaking ongoing replication.
The ability to fetch operations directly from the translog manager allows us to make CCR agnostic from the replication mechanism used inside the cluster as we also add support for segment replication.
We'll also build support on top of it for remote translog.
Why should it be built? Any reason not to?
This needs to be built so that we can keep supporting logical CCR 3.0 onwards
Only reason not to support this would be if we want to solely rely on Segment replication for CCR.
What will it take to execute?
Changes done in the PR should take solve the problem for now.
In future, when we start relying on remote translogs, we'll need to add support for fetching the operations from leader shard's remote store directly.
What are remaining open questions?
N/A
Is your feature request related to a problem?
Provide extension point for Tlog fetch operations under OS Engine:
The text was updated successfully, but these errors were encountered: