-
Notifications
You must be signed in to change notification settings - Fork 47
dr: adds shadowing docs #1381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: beta
Are you sure you want to change the base?
dr: adds shadowing docs #1381
Conversation
✅ Deploy Preview for redpanda-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthrough
Sequence Diagram(s)sequenceDiagram
autonumber
actor Admin as Operator
participant Prim as Primary Cluster
participant Shadow as Shadow Cluster
participant Ctrl as Admin API / rpk
participant Sec as Auth/TLS
participant Obs as Monitoring
rect rgb(235, 245, 255)
note over Admin,Ctrl: Configure Shadowing
Admin->>Ctrl: Create shadow link (templates, filters)
Ctrl->>Sec: Authenticate / TLS handshake
Ctrl->>Prim: Apply link config
Prim-->>Shadow: Establish replication channel
end
rect rgb(245, 255, 235)
note over Prim,Shadow: Ongoing Replication (normal ops)
Prim-->>Shadow: Replicate topics/configs/ACLs/schema
Prim-->>Shadow: Preserve offsets/timestamps (where applicable)
Admin->>Ctrl: rpk/admin queries (status/metrics)
Ctrl-->>Obs: Emit metrics/alerts
end
rect rgb(255, 245, 235)
note right of Admin: Planned ops are handled in Shadowing guide
end
sequenceDiagram
autonumber
actor Admin as Operator
participant Prim as Primary Cluster
participant Shadow as Shadow Cluster
participant Ctrl as Admin API / rpk
participant Apps as Applications/Clients
participant Obs as Monitoring
rect rgb(255, 245, 235)
note over Admin,Prim: Emergency Failover Runbook
Admin->>Prim: Assess incident, document state
Admin->>Shadow: Verify readiness/health
Admin->>Ctrl: Initiate failover (full or selective)
Ctrl->>Shadow: Transition shadow links (FAILING_OVER→ACTIVE)
Shadow-->>Obs: Report progress/status
end
rect rgb(245, 255, 235)
note over Apps,Shadow: Post-failover
Admin->>Apps: Update bootstrap/endpoints, TLS/ACLs
Apps->>Shadow: Reconnect and resume traffic
Admin->>Ctrl: Verify topics/consumer groups/offsets
end
alt Issues detected
Obs-->>Admin: Alerts (PAUSED, stuck states, auth failures)
Admin->>Ctrl: Troubleshoot per runbook steps
else Stable
Admin->>Prim: Plan recovery/back-sync later
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
modules/ROOT/nav.adoc (1)
88-91
: Add the emergency failover doc to navigationShadowing entry looks good. Add a sibling nav item for the emergency runbook so users can find it.
Example:
**** xref:deploy:redpanda/manual/high-availability.adoc[High Availability] **** xref:deploy:redpanda/manual/resilience/shadowing.adoc[Shadowing] +**** xref:deploy:redpanda/manual/resilience/emergency-shadowing.adoc[Emergency Shadowing Failover] **** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (2)
290-299
: Avoid promoting plaintext secrets in examplesAdd a callout suggesting env vars or file-based secrets for credentials (and mTLS certs/keys), not inline plaintext.
Example:
- Prefer env vars (RPK_SASL_PASSWORD) or reference secret files
- Link to security guidance on managing secrets
38-38
: Diagram TODOIf you need help, I can draft a diagram (draw.io/mermaid) showing active→shadow replication, preserved offsets/timestamps, and replicated artifacts.
modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (1)
74-83
: Call out irreversibility before executing failoverAdd an [IMPORTANT] note that failover promotion is irreversible; no automatic fallback. Place immediately before the commands.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Jira integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
modules/ROOT/nav.adoc
(1 hunks)modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc
(1 hunks)modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Redirect rules - redpanda-docs-preview
- GitHub Check: Header rules - redpanda-docs-preview
- GitHub Check: Pages changed - redpanda-docs-preview
🔇 Additional comments (6)
modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (2)
6-10
: Enterprise license note is consistent; LGTMKeep this partial include at the top across both docs for consistency.
48-56
: Verifyrpk shadow
subcommands and flags: Confirm thatrpk shadow list
,status
,failover
,delete
,resume
and their flags (--all
,--topic
,--no-confirm
) used inemergency-shadowing.adoc
(and the corresponding sections inshadowing.adoc
) match the current output ofrpk shadow --help
.modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (4)
330-425
: Verify ShadowLinkConfig schema alignment
Ensure the YAML example’s field names (client_options, authentication_configuration, topic_metadata_sync_options, synced_shadow_topic_properties, consumer_offset_sync_options, security_sync_options) exactly match the ShadowLinkConfig schema in the Admin API or rpk CLI.
54-57
: Verify and cite Shadowing’s minimum version requirement
- Confirm that Shadowing was introduced in Redpanda v25.3 and update the prerequisite if needed.
- Add a link to the official v25.3 release notes or product specification where this requirement is defined.
557-576
: Confirm shadow-link metrics are documented and standardize type/units
Verify that eachredpanda_shadow_link_*
metric appears in modules/reference/pages/public-metrics-reference.adoc and update every description to explicitly specify the Prometheus type (counter vs gauge) and units (bytes, records, offsets).
231-237
: Verifyrpk shadow config generate
exists and--output
flagConfirm this subcommand and its
--output
flag are implemented in the CLI; update the docs if they’re missing.
Redpanda system topics have specific filtering restrictions: | ||
|
||
* Literal filters for `__consumer_offsets` and `_redpanda.audit_log` are rejected | ||
* Prefix filters for topics starting with `_redpanda` or `__redpanda` are rejected | ||
* Wildcard `*` filters will not match topics that start with `_redpanda` or `__redpanda` | ||
* To shadow specific system topics, you must provide explicit literal filters for those individual topics | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contradictory system-topic filter guidance
You state literal filters for __consumer_offsets and _redpanda.audit_log are rejected, then say to shadow specific system topics use explicit literal filters. These conflict.
Please clarify which are allowed. If intent is “most _redpanda* topics cannot be matched by wildcard/prefix; only specific allowed system topics may be explicitly literal-included,” adjust bullets accordingly. Proposed text:
- Wildcard (*) never matches topics starting with _redpanda or __redpanda
- Prefix filters for _redpanda*/__redpanda* are rejected
- Only the following system topics may be shadowed via explicit literal filters: ; others are rejected
- __consumer_offsets: [allowed|rejected] with literal filter (pick one and be consistent)
🤖 Prompt for AI Agents
In modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc around lines
191–197, the bullets about system-topic filter behavior are contradictory (they
say literal filters for __consumer_offsets and _redpanda.audit_log are rejected,
then say to shadow specific system topics use explicit literal filters); update
the text to a single, consistent set of bullets that matches the reviewer’s
proposed wording: state that wildcard (*) never matches topics starting with
_redpanda or __redpanda, that prefix filters for _redpanda*/__redpanda* are
rejected, list exactly which system topics (if any) may be shadowed via explicit
literal filters (or state “none” if none allowed), and explicitly indicate
whether __consumer_offsets (and _redpanda.audit_log) are allowed or rejected
when using a literal filter so the guidance is unambiguous.
modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc
Outdated
Show resolved
Hide resolved
modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc
Outdated
Show resolved
Hide resolved
@paulohtb6 I have a hard time finding these changes in https://deploy-preview-1381--redpanda-docs-preview.netlify.app/current/get-started/intro-to-events/ (can you please point me to the exact URL). |
@bharathv Hey Bharath. Changes are in the page previews section on the PR description. Copying them here too |
Co-authored-by: Michele Cyran <[email protected]>
@@ -0,0 +1,212 @@ | |||
= Shadowing Runbook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulohtb6 I think the page should be renamed too, so "emergency" is not in the URL. Also, the term runbook feels internal to me. What do you think about Failover for Disaster Recovery or Disaster Recovery Guide? cc @Feediver1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "Shadowing Guide". Let me know if it's ok or if I should change more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial round of comments.
include::shared:partial$enterprise-license.adoc[] | ||
==== | ||
|
||
Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. The replication stream ensures that the shadow cluster receives exact copies of the source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly promote to handle production traffic during a disaster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works with Kafka too (which is one of the goals of the project), not sure if the wording is intentional.
edit: after reading further it appears that we are only documenting the requirements for 25.3, perhaps the wording should indicate they are of temporary nature somehow and will be lifted in the subsequent versions. In this case we envision this to be a migration tool too, so soon-ish users will be able to run this against Kafka.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to #1381 (comment) we want to document what's out there today.
Shadowing is designed for active-passive disaster recovery scenarios. Each shadow cluster can maintain only one shadow link. | ||
|
||
Shadowing operates exclusively in asynchronous mode and doesn't support active-active configurations. This means there will always be some replication lag. You cannot write to both clusters simultaneously. | ||
|
||
xref:develop:data-transforms/index.adoc[Data transforms] are disabled on shadow clusters while Shadowing is active. During a disaster, xref:manage:audit-logging.adoc[audit log] history from the source cluster is lost, though the shadow cluster begins generating new audit logs immediately after promotion. | ||
|
||
After you promote shadow topics with a failover, automatic fallback to the original source cluster is not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the limitations are temporary. For example "only one shadow link", "automatic fallback to the original source cluster is not supported. ", perhaps the wording should convey the temporary nature of these limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In docs we mostly document what's available today. We don't make commitmentsfor future releases. Whenever those limitations are lifted from core, we can revisit that and remove from that list.
The shadow link itself has a simple state model: | ||
|
||
* **`ACTIVE`**: Shadow link is operating normally, replicating data | ||
* **`PAUSED`**: Shadow link is temporarily stopped but not failed over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PAUSED is not implemented (yet), wonder if we should remove this for now.
=== Shadow link in PAUSED state | ||
|
||
**Problem**: Shadow link shows `PAUSED` instead of `ACTIVE` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PAUSED is not implemented yet (its impossible for the link to be in this state for now)
|
||
**Problem**: Topics remain in `FAILING_OVER` state for extended periods | ||
|
||
**Solution**: Check shadow cluster logs for errors and ensure sufficient resources (CPU, memory, disk) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also ensure there are no leaderless shadow topic partitions or leaderless /under-replicated controller.
|
||
Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. The replication stream ensures that the shadow cluster receives exact copies of the source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly promote to handle production traffic during a disaster. | ||
|
||
[IMPORTANT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might look better to keep the admonition but change IMPORTANT to Experiencing an active disaster as the heading; e.g., see here
Unlike traditional replication tools that re-produce messages, Shadowing copies data at the byte level, ensuring shadow topics contain identical copies of source topics with preserved offsets and timestamps. | ||
|
||
Shadowing replicates: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to limit links to only what is necessary or helpful; i.e., if users don't know anything about consumer offsets or schema registry, then I think they should search for info, because a link here just distracts them away from the reason they're on this page
[INFO] | ||
==== | ||
Redpanda Data recommends that you don't modify synced topic properties on shadow topics. These properties revert to the source topic values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[INFO] | |
==== | |
Redpanda Data recommends that you don't modify synced topic properties on shadow topics. These properties revert to the source topic values. | |
[CAUTION] | |
==== | |
Do not modify synced topic properties on shadow topics. These properties revert to source topic values. |
@paulohtb6 I think you need a single entry in the navtree: Shadowing. |
|
||
[IMPORTANT] | ||
==== | ||
This is an emergency procedure. For planned failover testing or regular shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an emergency procedure. For planned failover testing or regular shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs. | |
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs. |
|
||
== Emergency failover procedure | ||
|
||
Follow these steps in order during an active disaster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow these steps in order during an active disaster: | |
Follow these steps during an active disaster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that they are numbered steps shows they are sequential.
rpk cluster info --brokers <shadow-cluster-brokers> | ||
---- | ||
|
||
**Decision point**: If the primary cluster is responsive, consider whether failover is actually needed. Partial outages may not require full disaster recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you offer additional guidance re what type of partial outages require full DR and which would not? Perhaps list a couple of examples that illustrate each?
# List all shadow links | ||
rpk shadow list | ||
# Check status of your disaster recovery link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Check status of your disaster recovery link | |
# Check the status of your disaster recovery link |
rpk shadow status <disaster-recovery-link-name> | ||
---- | ||
|
||
Verify these conditions before proceeding with failover: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify these conditions before proceeding with failover: | |
Verify that the following conditions exist before proceeding with failover: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you show an example of this output?
rpk shadow status <disaster-recovery-link-name> > failover-status-$(date +%Y%m%d-%H%M%S).log | ||
---- | ||
|
||
IMPORTANT: Note the replication lag to estimate potential data loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much lag is acceptable/unacceptable? I know you say it depends on your RPO requirements, but can you offer users a specific example?
[[initiate-failover]] | ||
=== Initiate failover | ||
|
||
For complete cluster failover (recommended during disasters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For complete cluster failover (recommended during disasters): | |
If your data shows that you require a complete cluster failover (recommended during disasters): |
|
||
[,bash] | ||
---- | ||
# Fail over all topics in the shadow link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output?
rpk shadow failover <disaster-recovery-link-name> --all --no-confirm | ||
---- | ||
|
||
For selective topic failover (if only specific services are affected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For selective topic failover (if only specific services are affected): | |
For selective topic failover (when only specific services are affected): |
---- | ||
# Monitor status until all topics show PROMOTED | ||
watch -n 5 "rpk shadow status <disaster-recovery-link-name>" | ||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't mark in each step, but setting up a specific example throughout this workflow and then showing all the data that matches that example would be a nice user experience with docs--give users the validation they are likely seeking during a potentially stressful situation.
[[cleanup-stabilize]] | ||
=== Clean up and stabilize | ||
|
||
Once applications are running normally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once applications are running normally: | |
After all applications are running normally: |
|
||
**Solution**: Check shadow cluster logs for errors and ensure sufficient resources (CPU, memory, disk) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes. | ||
|
||
If topics remain stuck and you need to fail over everything immediately, you can force delete the shadow link, which will promote all topics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If topics remain stuck and you need to fail over everything immediately, you can force delete the shadow link, which will promote all topics: | |
If topics remain stuck and you need to failover everything immediately, you can force delete the shadow link, which will promote all topics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failover: This is a technical term used in IT and disaster recovery contexts. It describes the process of automatically or manually switching to a standby system, server, or network upon the failure of the primary system. Failover ensures continuity of service and minimizes downtime during outages. For example, if a server goes down, the system can failover to a backup server to maintain operations.
2
Fail Over: This term is often used informally to describe the action of switching from one system to another. However, it is not as widely recognized or used in formal documentation as "failover." In many cases, "fail over" may be seen as a less precise way to describe the same process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you put "fail over" in certain contexts throughout, but every time I see it followed by "failover" I find it distracting, or just an inconsistency (which can undermine a reader's confidence). Pretty minor/trivial, but thinking maybe just use failover throughout. At least that is Copilot's guidance. Mine too.
|
||
**Problem**: Applications cannot connect to shadow cluster after failover | ||
|
||
**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for shadow cluster and test network connectivity from application hosts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for shadow cluster and test network connectivity from application hosts. | |
**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for the shadow cluster and test network connectivity from application hosts. |
|
||
**Problem**: Consumers start from beginning or wrong positions | ||
|
||
**Solution**: Verify consumer group offsets were replicated (check your filters) and use `rpk group describe <group-name>` to check offset positions. If necessary, manually reset offsets to appropriate positions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to the doc they can follow for manually resetting consumer group offsets?
4. **Reverse replication**: Consider setting up shadowing in the opposite direction | ||
|
||
== Post-incident actions | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After identifying the cause and resolving the cluster failure, resume your regular disaster recovery planning tasks, which should include:
|
||
See also: link:/api/doc/admin/[Admin API v2 reference^] | ||
|
||
== Failover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add your note from the beginning of the doc here--users don't always read from start to finish, and we wouldn't want them to miss it:
Experiencing an active disaster? See Emergency Shadowing Failover for immediate step-by-step emergency procedures.
== How Shadowing fits into disaster recovery | ||
|
||
Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add either a link to definitions, or glossary definition for RPO and RTO
|
||
Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery. | ||
|
||
The architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster. | |
Redpanda's disaster recovery architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster. |
Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters. | ||
|
||
While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters. | |
While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss. | |
Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters. While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss. |
|
||
After you promote shadow topics with a failover, automatic fallback to the original source cluster is not supported. | ||
|
||
[INFO] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not come across this Info note format in our docs. It lacks specificity--is it a tip? Important reminder? Caution? Or just a run-of-the mill note (pay particular attention to this point)? Consider another note type so users clearly understand context/importance.
|
||
- You must have xref:get-started:licensing/overview.adoc[Enterprise Edition] licenses on both clusters. | ||
|
||
- Both clusters must run Redpanda version 25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs version 25.3, the shadow cluster can run 25.3 or 26.1, but not 27.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Both clusters must run Redpanda version 25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs version 25.3, the shadow cluster can run 25.3 or 26.1, but not 27.1. | |
- Both clusters must be running Redpanda v25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs v25.3, the shadow cluster can run v25.3 or v26.1, but not v27.1. |
|
||
To set up Shadowing: | ||
|
||
* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration | |
* **Understand replication behavior**: Before you set up Shadowing, be sure to review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration |
To set up Shadowing: | ||
|
||
* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration | ||
* **Configure filters**: Define which topics, consumer groups, and ACLs should replicate by creating include/exclude patterns that match your disaster recovery requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to "Set filters" below.
|
||
* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration | ||
* **Configure filters**: Define which topics, consumer groups, and ACLs should replicate by creating include/exclude patterns that match your disaster recovery requirements | ||
* **Create a shadow link**: Establish the connection between clusters using `rpk`, the Admin API, or Redpanda Console with authentication and network settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to this task below
Description
Adds Shadowing docs.
Adds emergency runbook.
Resolves https://redpandadata.atlassian.net/browse/DOC-1665
Review deadline: Oct 17th
Page previews
Shadowing
Shadowing guide
Checks