[improve][pip] PIP-327: Support force topic loading for unrecoverable…

… errors (#21752) Co-authored-by: 道君 <[email protected]>
apache · May 14, 2024 · 7468928 · 7468928
1 parent 9f1325a
commit 7468928
Showing 1 changed file with 42 additions and 0 deletions.
diff --git a/pip/pip-327.md b/pip/pip-327.md
@@ -0,0 +1,42 @@
+# PIP-327: Support force topic loading for unrecoverable errors
+
+# Motivation
+
+As discussed in Issue: https://github.com/apache/pulsar/issues/21751
+
+We have introduced a configuration called `autoSkipNonRecoverableData` before open-sourcing Pulsar as we have come across with various situations when it was not possible to recover ledgers belonging to managed-ledger or managed-cursors and the broker was not able to load the topics. In such situations,`autoSkipNonRecoverableData` flag helps to skip non-recoverable leger-recovery errors such as ledger_not_found and allows the broker to load topics by skipping such ledgers in disaster recovery.
+
+Brokers can recognize such non-recoverable errors using bookkeeper error codes but in some cases, it’s very tricky and not possible to conclude non-recoverable errors. For example, the broker can not differentiate between all the ensemble bookies of the ledgers that are temporarily unavailable or are permanently removed from the cluster without graceful recovery, and because of that broker doesn’t consider all the bookies deleted as a non-recoverable error though we can not recover ledgers in such situations where all the bookies are removed due to various reasons such as Dev cluster clean up or system faced data disaster with multiple bookie loss. In such situations, the system admin has to manually identify such non-recoverable topics and update those topics’ managed-ledger and managed-cursor’s metadata and reload topics again which requires a lot of manual effort and sometimes it might not be feasible to handle such situations with a large number of topics that require this manual procedure to fix those topics.
+
+Therefore, the system admin should have a dynamic configuration called `managedLedgerForceRecovery` to use in such situations to allow brokers to forcefully load topics by skipping ledger failures to avoid topic unavailability and perform auto repairs of the topics. This will allow the admin to handle disaster recovery situations in a controlled and automated manner and maintain the topic availability by mitigating such failures. 
+
+
+
+# Goals
+
+Support force topic loading and recovery for unrecoverable situation where broker can skip unrecoverable with uncertain bookkeeper error codes.
+
+
+## Design & Implementation Details
+
+### (1) Broker Changes
+
+Broker will have new configuration `managedLedgerForceRecovery` and if this flag is enabled then managed ledger will ignore any kind of failure if broker see's while recovering managed-ledger or managed-cursor.
+
+# Security Considerations
+<!--
+A detailed description of the security details that ought to be considered for the PIP. This is most relevant for any new HTTP endpoints, new Pulsar Protocol Commands, and new security features. The goal is to describe details like which role will have permission to perform an action.
+
+An important aspect to consider is also multi-tenancy: Does the feature I'm adding have the permissions / roles set in such a way that prevent one tenant accessing another tenant's data/configuration? For example, the Admin API to read a specific message for a topic only allows a client to read messages for the target topic. However, that was not always the case. CVE-2021-41571 (https://github.com/apache/pulsar/wiki/CVE-2021-41571) resulted because the API was incorrectly written and did not properly prevent a client from reading another topic's messages even though authorization was in place. The problem was missing input validation that verified the requested message was actually a message for that topic. The fix to CVE-2021-41571 was input validation. 
+
+If there is uncertainty for this section, please submit the PIP and request for feedback on the mailing list.
+-->
+
+
+# General Notes
+
+# Links
+
+Issue: https://github.com/apache/pulsar/issues/21751
+Discuss thread: https://lists.apache.org/thread/w7w91xztdyy07otw0dh71nl2rn3yy45p
+Vote thread: https://lists.apache.org/thread/hh9t6nz0pqjo7tbfn12nbwtylrvq4f43