Skip to content

Conversation

paulohtb6
Copy link
Collaborator

@paulohtb6 paulohtb6 commented Oct 8, 2025

Description

Adds Shadowing docs.
Adds emergency runbook.

Resolves https://redpandadata.atlassian.net/browse/DOC-1665
Review deadline: Oct 17th

Page previews

Shadowing
Shadowing guide

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

Copy link

netlify bot commented Oct 8, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 19a79d1
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/68f1a9d925a2600009a0456c
😎 Deploy Preview https://deploy-preview-1381--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

📝 Walkthrough

Walkthrough

  • Added a navigation entry for a new Shadowing guide under Redpanda deployment manual.
  • Introduced a comprehensive Shadowing documentation page covering architecture, scope, prerequisites, setup, configuration, filtering, monitoring, failover behavior, and best practices, with CLI/Admin API examples.
  • Added an emergency runbook page for disaster failover of Shadow Links, including assessment, verification, failover execution (cluster-wide or selective), monitoring, app reconfiguration, troubleshooting, recovery, and post-incident steps.
  • Included enterprise licensing cross-reference in the emergency guide.
  • No changes to exported/public code entities.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Admin as Operator
  participant Prim as Primary Cluster
  participant Shadow as Shadow Cluster
  participant Ctrl as Admin API / rpk
  participant Sec as Auth/TLS
  participant Obs as Monitoring

  rect rgb(235, 245, 255)
  note over Admin,Ctrl: Configure Shadowing
  Admin->>Ctrl: Create shadow link (templates, filters)
  Ctrl->>Sec: Authenticate / TLS handshake
  Ctrl->>Prim: Apply link config
  Prim-->>Shadow: Establish replication channel
  end

  rect rgb(245, 255, 235)
  note over Prim,Shadow: Ongoing Replication (normal ops)
  Prim-->>Shadow: Replicate topics/configs/ACLs/schema
  Prim-->>Shadow: Preserve offsets/timestamps (where applicable)
  Admin->>Ctrl: rpk/admin queries (status/metrics)
  Ctrl-->>Obs: Emit metrics/alerts
  end

  rect rgb(255, 245, 235)
  note right of Admin: Planned ops are handled in Shadowing guide
  end
Loading
sequenceDiagram
  autonumber
  actor Admin as Operator
  participant Prim as Primary Cluster
  participant Shadow as Shadow Cluster
  participant Ctrl as Admin API / rpk
  participant Apps as Applications/Clients
  participant Obs as Monitoring

  rect rgb(255, 245, 235)
  note over Admin,Prim: Emergency Failover Runbook
  Admin->>Prim: Assess incident, document state
  Admin->>Shadow: Verify readiness/health
  Admin->>Ctrl: Initiate failover (full or selective)
  Ctrl->>Shadow: Transition shadow links (FAILING_OVER→ACTIVE)
  Shadow-->>Obs: Report progress/status
  end

  rect rgb(245, 255, 235)
  note over Apps,Shadow: Post-failover
  Admin->>Apps: Update bootstrap/endpoints, TLS/ACLs
  Apps->>Shadow: Reconnect and resume traffic
  Admin->>Ctrl: Verify topics/consumer groups/offsets
  end

  alt Issues detected
    Obs-->>Admin: Alerts (PAUSED, stuck states, auth failures)
    Admin->>Ctrl: Troubleshoot per runbook steps
  else Stable
    Admin->>Prim: Plan recovery/back-sync later
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly conveys the main change by stating that shadowing documentation is being added and does not include extraneous details, making it clear and focused on the core update.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Description Check ✅ Passed The pull request description follows the required template structure with all necessary sections completed. The Description section includes the JIRA ticket reference (DOC-1665) and review deadline (Oct 17th), clearly stating what is being added (Shadowing docs and emergency runbook). The Page previews section provides two properly formatted preview links following the specified pattern. The Checks section includes the required checkboxes with "New feature" appropriately selected. The description is concise and complete, providing sufficient context for reviewers to understand the scope of the changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch shadowing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@paulohtb6 paulohtb6 marked this pull request as ready for review October 15, 2025 02:44
@paulohtb6 paulohtb6 requested a review from a team as a code owner October 15, 2025 02:44
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
modules/ROOT/nav.adoc (1)

88-91: Add the emergency failover doc to navigation

Shadowing entry looks good. Add a sibling nav item for the emergency runbook so users can find it.

Example:

 **** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
 **** xref:deploy:redpanda/manual/resilience/shadowing.adoc[Shadowing]
+**** xref:deploy:redpanda/manual/resilience/emergency-shadowing.adoc[Emergency Shadowing Failover]
 **** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (2)

290-299: Avoid promoting plaintext secrets in examples

Add a callout suggesting env vars or file-based secrets for credentials (and mTLS certs/keys), not inline plaintext.

Example:

  • Prefer env vars (RPK_SASL_PASSWORD) or reference secret files
  • Link to security guidance on managing secrets

38-38: Diagram TODO

If you need help, I can draft a diagram (draw.io/mermaid) showing active→shadow replication, preserved offsets/timestamps, and replicated artifacts.

modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (1)

74-83: Call out irreversibility before executing failover

Add an [IMPORTANT] note that failover promotion is irreversible; no automatic fallback. Place immediately before the commands.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 464513b and dafed89.

📒 Files selected for processing (3)
  • modules/ROOT/nav.adoc (1 hunks)
  • modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (1 hunks)
  • modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Redirect rules - redpanda-docs-preview
  • GitHub Check: Header rules - redpanda-docs-preview
  • GitHub Check: Pages changed - redpanda-docs-preview
🔇 Additional comments (6)
modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (2)

6-10: Enterprise license note is consistent; LGTM

Keep this partial include at the top across both docs for consistency.


48-56: Verify rpk shadow subcommands and flags: Confirm that rpk shadow list, status, failover, delete, resume and their flags (--all, --topic, --no-confirm) used in emergency-shadowing.adoc (and the corresponding sections in shadowing.adoc) match the current output of rpk shadow --help.

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (4)

330-425: Verify ShadowLinkConfig schema alignment
Ensure the YAML example’s field names (client_options, authentication_configuration, topic_metadata_sync_options, synced_shadow_topic_properties, consumer_offset_sync_options, security_sync_options) exactly match the ShadowLinkConfig schema in the Admin API or rpk CLI.


54-57: Verify and cite Shadowing’s minimum version requirement

  • Confirm that Shadowing was introduced in Redpanda v25.3 and update the prerequisite if needed.
  • Add a link to the official v25.3 release notes or product specification where this requirement is defined.

557-576: Confirm shadow-link metrics are documented and standardize type/units
Verify that each redpanda_shadow_link_* metric appears in modules/reference/pages/public-metrics-reference.adoc and update every description to explicitly specify the Prometheus type (counter vs gauge) and units (bytes, records, offsets).


231-237: Verify rpk shadow config generate exists and --output flag

Confirm this subcommand and its --output flag are implemented in the CLI; update the docs if they’re missing.

Comment on lines 191 to 197
Redpanda system topics have specific filtering restrictions:

* Literal filters for `__consumer_offsets` and `_redpanda.audit_log` are rejected
* Prefix filters for topics starting with `_redpanda` or `__redpanda` are rejected
* Wildcard `*` filters will not match topics that start with `_redpanda` or `__redpanda`
* To shadow specific system topics, you must provide explicit literal filters for those individual topics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Contradictory system-topic filter guidance

You state literal filters for __consumer_offsets and _redpanda.audit_log are rejected, then say to shadow specific system topics use explicit literal filters. These conflict.

Please clarify which are allowed. If intent is “most _redpanda* topics cannot be matched by wildcard/prefix; only specific allowed system topics may be explicitly literal-included,” adjust bullets accordingly. Proposed text:

  • Wildcard (*) never matches topics starting with _redpanda or __redpanda
  • Prefix filters for _redpanda*/__redpanda* are rejected
  • Only the following system topics may be shadowed via explicit literal filters: ; others are rejected
  • __consumer_offsets: [allowed|rejected] with literal filter (pick one and be consistent)
🤖 Prompt for AI Agents
In modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc around lines
191–197, the bullets about system-topic filter behavior are contradictory (they
say literal filters for __consumer_offsets and _redpanda.audit_log are rejected,
then say to shadow specific system topics use explicit literal filters); update
the text to a single, consistent set of bullets that matches the reviewer’s
proposed wording: state that wildcard (*) never matches topics starting with
_redpanda or __redpanda, that prefix filters for _redpanda*/__redpanda* are
rejected, list exactly which system topics (if any) may be shadowed via explicit
literal filters (or state “none” if none allowed), and explicitly indicate
whether __consumer_offsets (and _redpanda.audit_log) are allowed or rejected
when using a literal filter so the guidance is unambiguous.

@paulohtb6 paulohtb6 changed the base branch from main to beta October 16, 2025 15:16
@bharathv
Copy link

@paulohtb6 I have a hard time finding these changes in https://deploy-preview-1381--redpanda-docs-preview.netlify.app/current/get-started/intro-to-events/ (can you please point me to the exact URL).

@paulohtb6
Copy link
Collaborator Author

@bharathv Hey Bharath. Changes are in the page previews section on the PR description.

Copying them here too
Page previews
Shadowing
Shadowing runbook

Co-authored-by: Michele Cyran <[email protected]>
@@ -0,0 +1,212 @@
= Shadowing Runbook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulohtb6 I think the page should be renamed too, so "emergency" is not in the URL. Also, the term runbook feels internal to me. What do you think about Failover for Disaster Recovery or Disaster Recovery Guide? cc @Feediver1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "Shadowing Guide". Let me know if it's ok or if I should change more.

Copy link

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial round of comments.

include::shared:partial$enterprise-license.adoc[]
====

Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. The replication stream ensures that the shadow cluster receives exact copies of the source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly promote to handle production traffic during a disaster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works with Kafka too (which is one of the goals of the project), not sure if the wording is intentional.

edit: after reading further it appears that we are only documenting the requirements for 25.3, perhaps the wording should indicate they are of temporary nature somehow and will be lifted in the subsequent versions. In this case we envision this to be a migration tool too, so soon-ish users will be able to run this against Kafka.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to #1381 (comment) we want to document what's out there today.

Comment on lines +42 to +48
Shadowing is designed for active-passive disaster recovery scenarios. Each shadow cluster can maintain only one shadow link.

Shadowing operates exclusively in asynchronous mode and doesn't support active-active configurations. This means there will always be some replication lag. You cannot write to both clusters simultaneously.

xref:develop:data-transforms/index.adoc[Data transforms] are disabled on shadow clusters while Shadowing is active. During a disaster, xref:manage:audit-logging.adoc[audit log] history from the source cluster is lost, though the shadow cluster begins generating new audit logs immediately after promotion.

After you promote shadow topics with a failover, automatic fallback to the original source cluster is not supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the limitations are temporary. For example "only one shadow link", "automatic fallback to the original source cluster is not supported. ", perhaps the wording should convey the temporary nature of these limitations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In docs we mostly document what's available today. We don't make commitmentsfor future releases. Whenever those limitations are lifted from core, we can revisit that and remove from that list.

The shadow link itself has a simple state model:

* **`ACTIVE`**: Shadow link is operating normally, replicating data
* **`PAUSED`**: Shadow link is temporarily stopped but not failed over

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PAUSED is not implemented (yet), wonder if we should remove this for now.

Comment on lines +149 to +152
=== Shadow link in PAUSED state

**Problem**: Shadow link shows `PAUSED` instead of `ACTIVE`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PAUSED is not implemented yet (its impossible for the link to be in this state for now)


**Problem**: Topics remain in `FAILING_OVER` state for extended periods

**Solution**: Check shadow cluster logs for errors and ensure sufficient resources (CPU, memory, disk) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also ensure there are no leaderless shadow topic partitions or leaderless /under-replicated controller.


Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. The replication stream ensures that the shadow cluster receives exact copies of the source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly promote to handle production traffic during a disaster.

[IMPORTANT]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might look better to keep the admonition but change IMPORTANT to Experiencing an active disaster as the heading; e.g., see here

Unlike traditional replication tools that re-produce messages, Shadowing copies data at the byte level, ensuring shadow topics contain identical copies of source topics with preserved offsets and timestamps.

Shadowing replicates:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to limit links to only what is necessary or helpful; i.e., if users don't know anything about consumer offsets or schema registry, then I think they should search for info, because a link here just distracts them away from the reason they're on this page

Comment on lines +50 to +52
[INFO]
====
Redpanda Data recommends that you don't modify synced topic properties on shadow topics. These properties revert to the source topic values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[INFO]
====
Redpanda Data recommends that you don't modify synced topic properties on shadow topics. These properties revert to the source topic values.
[CAUTION]
====
Do not modify synced topic properties on shadow topics. These properties revert to source topic values.

@Feediver1
Copy link
Contributor

@paulohtb6 I think you need a single entry in the navtree: Shadowing.
Then an index page that has the content from Shadowing, and another for the Shadowing Guide.
Initially, I asked myself, what is the difference here? Let's discuss whether or not "guide" is the best description to use in the navtree.


[IMPORTANT]
====
This is an emergency procedure. For planned failover testing or regular shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is an emergency procedure. For planned failover testing or regular shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.


== Emergency failover procedure

Follow these steps in order during an active disaster:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Follow these steps in order during an active disaster:
Follow these steps during an active disaster:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that they are numbered steps shows they are sequential.

rpk cluster info --brokers <shadow-cluster-brokers>
----

**Decision point**: If the primary cluster is responsive, consider whether failover is actually needed. Partial outages may not require full disaster recovery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you offer additional guidance re what type of partial outages require full DR and which would not? Perhaps list a couple of examples that illustrate each?

# List all shadow links
rpk shadow list
# Check status of your disaster recovery link
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check status of your disaster recovery link
# Check the status of your disaster recovery link

rpk shadow status <disaster-recovery-link-name>
----

Verify these conditions before proceeding with failover:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Verify these conditions before proceeding with failover:
Verify that the following conditions exist before proceeding with failover:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you show an example of this output?

rpk shadow status <disaster-recovery-link-name> > failover-status-$(date +%Y%m%d-%H%M%S).log
----

IMPORTANT: Note the replication lag to estimate potential data loss.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much lag is acceptable/unacceptable? I know you say it depends on your RPO requirements, but can you offer users a specific example?

[[initiate-failover]]
=== Initiate failover

For complete cluster failover (recommended during disasters):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For complete cluster failover (recommended during disasters):
If your data shows that you require a complete cluster failover (recommended during disasters):


[,bash]
----
# Fail over all topics in the shadow link
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output?

rpk shadow failover <disaster-recovery-link-name> --all --no-confirm
----

For selective topic failover (if only specific services are affected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For selective topic failover (if only specific services are affected):
For selective topic failover (when only specific services are affected):

----
# Monitor status until all topics show PROMOTED
watch -n 5 "rpk shadow status <disaster-recovery-link-name>"
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't mark in each step, but setting up a specific example throughout this workflow and then showing all the data that matches that example would be a nice user experience with docs--give users the validation they are likely seeking during a potentially stressful situation.

[[cleanup-stabilize]]
=== Clean up and stabilize

Once applications are running normally:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once applications are running normally:
After all applications are running normally:


**Solution**: Check shadow cluster logs for errors and ensure sufficient resources (CPU, memory, disk) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes.

If topics remain stuck and you need to fail over everything immediately, you can force delete the shadow link, which will promote all topics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If topics remain stuck and you need to fail over everything immediately, you can force delete the shadow link, which will promote all topics:
If topics remain stuck and you need to failover everything immediately, you can force delete the shadow link, which will promote all topics:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failover: This is a technical term used in IT and disaster recovery contexts. It describes the process of automatically or manually switching to a standby system, server, or network upon the failure of the primary system. Failover ensures continuity of service and minimizes downtime during outages. For example, if a server goes down, the system can failover to a backup server to maintain operations.
2
Fail Over: This term is often used informally to describe the action of switching from one system to another. However, it is not as widely recognized or used in formal documentation as "failover." In many cases, "fail over" may be seen as a less precise way to describe the same process.

Copy link
Contributor

@Feediver1 Feediver1 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you put "fail over" in certain contexts throughout, but every time I see it followed by "failover" I find it distracting, or just an inconsistency (which can undermine a reader's confidence). Pretty minor/trivial, but thinking maybe just use failover throughout. At least that is Copilot's guidance. Mine too.


**Problem**: Applications cannot connect to shadow cluster after failover

**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for shadow cluster and test network connectivity from application hosts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for shadow cluster and test network connectivity from application hosts.
**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for the shadow cluster and test network connectivity from application hosts.


**Problem**: Consumers start from beginning or wrong positions

**Solution**: Verify consumer group offsets were replicated (check your filters) and use `rpk group describe <group-name>` to check offset positions. If necessary, manually reset offsets to appropriate positions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to the doc they can follow for manually resetting consumer group offsets?

https://support.redpanda.com/hc/en-us/articles/23499121317399-How-to-manage-consumer-group-offsets-in-Redpanda

4. **Reverse replication**: Consider setting up shadowing in the opposite direction

== Post-incident actions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After identifying the cause and resolving the cluster failure, resume your regular disaster recovery planning tasks, which should include:


See also: link:/api/doc/admin/[Admin API v2 reference^]

== Failover
Copy link
Contributor

@Feediver1 Feediver1 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add your note from the beginning of the doc here--users don't always read from start to finish, and we wouldn't want them to miss it:

Experiencing an active disaster? See Emergency Shadowing Failover for immediate step-by-step emergency procedures.

== How Shadowing fits into disaster recovery

Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add either a link to definitions, or glossary definition for RPO and RTO


Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.

The architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.
Redpanda's disaster recovery architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.

Comment on lines +34 to +36
Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters.

While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters.
While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.
Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters. While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.


After you promote shadow topics with a failover, automatic fallback to the original source cluster is not supported.

[INFO]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not come across this Info note format in our docs. It lacks specificity--is it a tip? Important reminder? Caution? Or just a run-of-the mill note (pay particular attention to this point)? Consider another note type so users clearly understand context/importance.


- You must have xref:get-started:licensing/overview.adoc[Enterprise Edition] licenses on both clusters.

- Both clusters must run Redpanda version 25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs version 25.3, the shadow cluster can run 25.3 or 26.1, but not 27.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Both clusters must run Redpanda version 25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs version 25.3, the shadow cluster can run 25.3 or 26.1, but not 27.1.
- Both clusters must be running Redpanda v25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs v25.3, the shadow cluster can run v25.3 or v26.1, but not v27.1.


To set up Shadowing:

* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration
* **Understand replication behavior**: Before you set up Shadowing, be sure to review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration

To set up Shadowing:

* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration
* **Configure filters**: Define which topics, consumer groups, and ACLs should replicate by creating include/exclude patterns that match your disaster recovery requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to "Set filters" below.


* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration
* **Configure filters**: Define which topics, consumer groups, and ACLs should replicate by creating include/exclude patterns that match your disaster recovery requirements
* **Create a shadow link**: Establish the connection between clusters using `rpk`, the Admin API, or Redpanda Console with authentication and network settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to this task below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants