dr: adds shadowing docs #1381

paulohtb6 · 2025-10-08T23:56:48Z

Description

Adds Shadowing docs.
Adds emergency runbook.

Resolves https://redpandadata.atlassian.net/browse/DOC-1665
Review deadline: Oct 17th

Page previews

Shadowing
Shadowing guide

Checks

New feature
Content gap
Support Follow-up
Small fix (typos, links, copyedits, etc)

netlify · 2025-10-08T23:56:54Z

✅ Deploy Preview for redpanda-docs-preview ready!

Name	Link
🔨 Latest commit	`19a79d1`
🔍 Latest deploy log	https://app.netlify.com/projects/redpanda-docs-preview/deploys/68f1a9d925a2600009a0456c
😎 Deploy Preview	https://deploy-preview-1381--redpanda-docs-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2025-10-08T23:57:09Z

📝 Walkthrough

Walkthrough

Added a navigation entry for a new Shadowing guide under Redpanda deployment manual.
Introduced a comprehensive Shadowing documentation page covering architecture, scope, prerequisites, setup, configuration, filtering, monitoring, failover behavior, and best practices, with CLI/Admin API examples.
Added an emergency runbook page for disaster failover of Shadow Links, including assessment, verification, failover execution (cluster-wide or selective), monitoring, app reconfiguration, troubleshooting, recovery, and post-incident steps.
Included enterprise licensing cross-reference in the emergency guide.
No changes to exported/public code entities.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Admin as Operator
  participant Prim as Primary Cluster
  participant Shadow as Shadow Cluster
  participant Ctrl as Admin API / rpk
  participant Sec as Auth/TLS
  participant Obs as Monitoring

  rect rgb(235, 245, 255)
  note over Admin,Ctrl: Configure Shadowing
  Admin->>Ctrl: Create shadow link (templates, filters)
  Ctrl->>Sec: Authenticate / TLS handshake
  Ctrl->>Prim: Apply link config
  Prim-->>Shadow: Establish replication channel
  end

  rect rgb(245, 255, 235)
  note over Prim,Shadow: Ongoing Replication (normal ops)
  Prim-->>Shadow: Replicate topics/configs/ACLs/schema
  Prim-->>Shadow: Preserve offsets/timestamps (where applicable)
  Admin->>Ctrl: rpk/admin queries (status/metrics)
  Ctrl-->>Obs: Emit metrics/alerts
  end

  rect rgb(255, 245, 235)
  note right of Admin: Planned ops are handled in Shadowing guide
  end

sequenceDiagram
  autonumber
  actor Admin as Operator
  participant Prim as Primary Cluster
  participant Shadow as Shadow Cluster
  participant Ctrl as Admin API / rpk
  participant Apps as Applications/Clients
  participant Obs as Monitoring

  rect rgb(255, 245, 235)
  note over Admin,Prim: Emergency Failover Runbook
  Admin->>Prim: Assess incident, document state
  Admin->>Shadow: Verify readiness/health
  Admin->>Ctrl: Initiate failover (full or selective)
  Ctrl->>Shadow: Transition shadow links (FAILING_OVER→ACTIVE)
  Shadow-->>Obs: Report progress/status
  end

  rect rgb(245, 255, 235)
  note over Apps,Shadow: Post-failover
  Admin->>Apps: Update bootstrap/endpoints, TLS/ACLs
  Apps->>Shadow: Reconnect and resume traffic
  Admin->>Ctrl: Verify topics/consumer groups/offsets
  end

  alt Issues detected
    Obs-->>Admin: Alerts (PAUSED, stuck states, auth failures)
    Admin->>Ctrl: Troubleshoot per runbook steps
  else Stable
    Admin->>Prim: Plan recovery/back-sync later
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly conveys the main change by stating that shadowing documentation is being added and does not include extraneous details, making it clear and focused on the core update.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Description Check	✅ Passed	The pull request description follows the required template structure with all necessary sections completed. The Description section includes the JIRA ticket reference (DOC-1665) and review deadline (Oct 17th), clearly stating what is being added (Shadowing docs and emergency runbook). The Page previews section provides two properly formatted preview links following the specified pattern. The Checks section includes the required checkboxes with "New feature" appropriately selected. The description is concise and complete, providing sufficient context for reviewers to understand the scope of the changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch shadowing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

modules/ROOT/nav.adoc (1)
88-91: Add the emergency failover doc to navigation

Shadowing entry looks good. Add a sibling nav item for the emergency runbook so users can find it.

Example:
 **** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
 **** xref:deploy:redpanda/manual/resilience/shadowing.adoc[Shadowing]
+**** xref:deploy:redpanda/manual/resilience/emergency-shadowing.adoc[Emergency Shadowing Failover]
 **** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (2)

290-299: Avoid promoting plaintext secrets in examples

Add a callout suggesting env vars or file-based secrets for credentials (and mTLS certs/keys), not inline plaintext.

Example:

Prefer env vars (RPK_SASL_PASSWORD) or reference secret files

Link to security guidance on managing secrets

38-38: Diagram TODO

If you need help, I can draft a diagram (draw.io/mermaid) showing active→shadow replication, preserved offsets/timestamps, and replicated artifacts.

modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (1)

74-83: Call out irreversibility before executing failover

Add an [IMPORTANT] note that failover promotion is irreversible; no automatic fallback. Place immediately before the commands.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 464513b and dafed89.

📒 Files selected for processing (3)

modules/ROOT/nav.adoc (1 hunks)
modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (1 hunks)
modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Redirect rules - redpanda-docs-preview
GitHub Check: Header rules - redpanda-docs-preview
GitHub Check: Pages changed - redpanda-docs-preview

🔇 Additional comments (6)

modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (2)

6-10: Enterprise license note is consistent; LGTM

Keep this partial include at the top across both docs for consistency.

48-56: Verify rpk shadow subcommands and flags: Confirm that rpk shadow list, status, failover, delete, resume and their flags (--all, --topic, --no-confirm) used in emergency-shadowing.adoc (and the corresponding sections in shadowing.adoc) match the current output of rpk shadow --help.

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (4)

330-425: Verify ShadowLinkConfig schema alignment
Ensure the YAML example’s field names (client_options, authentication_configuration, topic_metadata_sync_options, synced_shadow_topic_properties, consumer_offset_sync_options, security_sync_options) exactly match the ShadowLinkConfig schema in the Admin API or rpk CLI.

54-57: Verify and cite Shadowing’s minimum version requirement

Confirm that Shadowing was introduced in Redpanda v25.3 and update the prerequisite if needed.

Add a link to the official v25.3 release notes or product specification where this requirement is defined.

557-576: Confirm shadow-link metrics are documented and standardize type/units
Verify that each redpanda_shadow_link_* metric appears in modules/reference/pages/public-metrics-reference.adoc and update every description to explicitly specify the Prometheus type (counter vs gauge) and units (bytes, records, offsets).

231-237: Verify rpk shadow config generate exists and --output flag

Confirm this subcommand and its --output flag are implemented in the CLI; update the docs if they’re missing.

coderabbitai · 2025-10-15T02:50:34Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+Redpanda system topics have specific filtering restrictions:
+
+* Literal filters for `__consumer_offsets` and `_redpanda.audit_log` are rejected
+* Prefix filters for topics starting with `_redpanda` or `__redpanda` are rejected  
+* Wildcard `*` filters will not match topics that start with `_redpanda` or `__redpanda`
+* To shadow specific system topics, you must provide explicit literal filters for those individual topics
+


⚠️ Potential issue | 🔴 Critical

Contradictory system-topic filter guidance

You state literal filters for __consumer_offsets and _redpanda.audit_log are rejected, then say to shadow specific system topics use explicit literal filters. These conflict.

Please clarify which are allowed. If intent is “most _redpanda* topics cannot be matched by wildcard/prefix; only specific allowed system topics may be explicitly literal-included,” adjust bullets accordingly. Proposed text:

Wildcard (*) never matches topics starting with _redpanda or __redpanda

Prefix filters for _redpanda*/__redpanda* are rejected

Only the following system topics may be shadowed via explicit literal filters: ; others are rejected

__consumer_offsets: [allowed|rejected] with literal filter (pick one and be consistent)

🤖 Prompt for AI Agents

In modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc around lines 191–197, the bullets about system-topic filter behavior are contradictory (they say literal filters for __consumer_offsets and _redpanda.audit_log are rejected, then say to shadow specific system topics use explicit literal filters); update the text to a single, consistent set of bullets that matches the reviewer’s proposed wording: state that wildcard (*) never matches topics starting with _redpanda or __redpanda, that prefix filters for _redpanda*/__redpanda* are rejected, list exactly which system topics (if any) may be shadowed via explicit literal filters (or state “none” if none allowed), and explicitly indicate whether __consumer_offsets (and _redpanda.audit_log) are allowed or rejected when using a literal filter so the guidance is unambiguous.

modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc

bharathv · 2025-10-16T18:43:17Z

@paulohtb6 I have a hard time finding these changes in https://deploy-preview-1381--redpanda-docs-preview.netlify.app/current/get-started/intro-to-events/ (can you please point me to the exact URL).

paulohtb6 · 2025-10-16T19:53:01Z

@bharathv Hey Bharath. Changes are in the page previews section on the PR description.

Copying them here too
Page previews
Shadowing
Shadowing runbook

modules/ROOT/nav.adoc

Co-authored-by: Michele Cyran <[email protected]>

micheleRP · 2025-10-16T20:59:40Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

@@ -0,0 +1,212 @@
+= Shadowing Runbook


@paulohtb6 I think the page should be renamed too, so "emergency" is not in the URL. Also, the term runbook feels internal to me. What do you think about Failover for Disaster Recovery or Disaster Recovery Guide? cc @Feediver1

Changed to "Shadowing Guide". Let me know if it's ok or if I should change more.

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

bharathv

initial round of comments.

bharathv · 2025-10-17T02:59:01Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+include::shared:partial$enterprise-license.adoc[]
+====
+
+Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. The replication stream ensures that the shadow cluster receives exact copies of the source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly promote to handle production traffic during a disaster.


It works with Kafka too (which is one of the goals of the project), not sure if the wording is intentional.

edit: after reading further it appears that we are only documenting the requirements for 25.3, perhaps the wording should indicate they are of temporary nature somehow and will be lifted in the subsequent versions. In this case we envision this to be a migration tool too, so soon-ish users will be able to run this against Kafka.

Similar to #1381 (comment) we want to document what's out there today.

bharathv · 2025-10-17T03:02:19Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+Shadowing is designed for active-passive disaster recovery scenarios. Each shadow cluster can maintain only one shadow link.
+
+Shadowing operates exclusively in asynchronous mode and doesn't support active-active configurations. This means there will always be some replication lag. You cannot write to both clusters simultaneously.
+
+xref:develop:data-transforms/index.adoc[Data transforms] are disabled on shadow clusters while Shadowing is active. During a disaster, xref:manage:audit-logging.adoc[audit log] history from the source cluster is lost, though the shadow cluster begins generating new audit logs immediately after promotion.
+
+After you promote shadow topics with a failover, automatic fallback to the original source cluster is not supported. 


Some of the limitations are temporary. For example "only one shadow link", "automatic fallback to the original source cluster is not supported. ", perhaps the wording should convey the temporary nature of these limitations.

In docs we mostly document what's available today. We don't make commitmentsfor future releases. Whenever those limitations are lifted from core, we can revisit that and remove from that list.

bharathv · 2025-10-17T03:06:40Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+The shadow link itself has a simple state model:
+
+* **`ACTIVE`**: Shadow link is operating normally, replicating data
+* **`PAUSED`**: Shadow link is temporarily stopped but not failed over


PAUSED is not implemented (yet), wonder if we should remove this for now.

bharathv · 2025-10-17T03:47:08Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+=== Shadow link in PAUSED state
+
+**Problem**: Shadow link shows `PAUSED` instead of `ACTIVE`
+


PAUSED is not implemented yet (its impossible for the link to be in this state for now)

bharathv · 2025-10-17T03:48:49Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+
+**Problem**: Topics remain in `FAILING_OVER` state for extended periods
+
+**Solution**: Check shadow cluster logs for errors and ensure sufficient resources (CPU, memory, disk) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes.


also ensure there are no leaderless shadow topic partitions or leaderless /under-replicated controller.

micheleRP · 2025-10-17T03:55:15Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+
+Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. The replication stream ensures that the shadow cluster receives exact copies of the source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly promote to handle production traffic during a disaster.
+
+[IMPORTANT]


Might look better to keep the admonition but change IMPORTANT to Experiencing an active disaster as the heading; e.g., see here

micheleRP · 2025-10-17T03:58:47Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+Unlike traditional replication tools that re-produce messages, Shadowing copies data at the byte level, ensuring shadow topics contain identical copies of source topics with preserved offsets and timestamps.
+
+Shadowing replicates:
+


I'd like to limit links to only what is necessary or helpful; i.e., if users don't know anything about consumer offsets or schema registry, then I think they should search for info, because a link here just distracts them away from the reason they're on this page

micheleRP · 2025-10-17T04:01:14Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+[INFO]
+====
+Redpanda Data recommends that you don't modify synced topic properties on shadow topics. These properties revert to the source topic values.


Suggested change

[INFO]

====

Redpanda Data recommends that you don't modify synced topic properties on shadow topics. These properties revert to the source topic values.

[CAUTION]

====

Do not modify synced topic properties on shadow topics. These properties revert to source topic values.

Feediver1 · 2025-10-17T15:06:59Z

@paulohtb6 I think you need a single entry in the navtree: Shadowing.
Then an index page that has the content from Shadowing, and another for the Shadowing Guide.
Initially, I asked myself, what is the difference here? Let's discuss whether or not "guide" is the best description to use in the navtree.

Feediver1 · 2025-10-17T15:10:38Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+
+[IMPORTANT]
+====
+This is an emergency procedure. For planned failover testing or regular shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.


Suggested change

This is an emergency procedure. For planned failover testing or regular shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.

This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.

Feediver1 · 2025-10-17T15:11:55Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+
+== Emergency failover procedure
+
+Follow these steps in order during an active disaster:


Suggested change

Follow these steps in order during an active disaster:

Follow these steps during an active disaster:

The fact that they are numbered steps shows they are sequential.

Feediver1 · 2025-10-17T15:19:29Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+rpk cluster info --brokers <shadow-cluster-brokers>
+----
+
+**Decision point**: If the primary cluster is responsive, consider whether failover is actually needed. Partial outages may not require full disaster recovery.


Can you offer additional guidance re what type of partial outages require full DR and which would not? Perhaps list a couple of examples that illustrate each?

Feediver1 · 2025-10-17T15:19:57Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+# List all shadow links
+rpk shadow list
+
+# Check status of your disaster recovery link


Suggested change

# Check status of your disaster recovery link

# Check the status of your disaster recovery link

Feediver1 · 2025-10-17T15:24:42Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+rpk shadow status <disaster-recovery-link-name>
+----
+
+Verify these conditions before proceeding with failover:


Suggested change

Verify these conditions before proceeding with failover:

Verify that the following conditions exist before proceeding with failover:

Also, can you show an example of this output?

Feediver1 · 2025-10-17T15:27:02Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+rpk shadow status <disaster-recovery-link-name> > failover-status-$(date +%Y%m%d-%H%M%S).log
+----
+
+IMPORTANT: Note the replication lag to estimate potential data loss.


How much lag is acceptable/unacceptable? I know you say it depends on your RPO requirements, but can you offer users a specific example?

Feediver1 · 2025-10-17T15:28:45Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+[[initiate-failover]]
+=== Initiate failover
+
+For complete cluster failover (recommended during disasters):


Suggested change

For complete cluster failover (recommended during disasters):

If your data shows that you require a complete cluster failover (recommended during disasters):

Feediver1 · 2025-10-17T15:29:35Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+
+[,bash]
+----
+# Fail over all topics in the shadow link


Feediver1 · 2025-10-17T15:29:57Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+rpk shadow failover <disaster-recovery-link-name> --all --no-confirm
+----
+
+For selective topic failover (if only specific services are affected):


Suggested change

For selective topic failover (if only specific services are affected):

For selective topic failover (when only specific services are affected):

Feediver1 · 2025-10-17T15:32:13Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+----
+# Monitor status until all topics show PROMOTED
+watch -n 5 "rpk shadow status <disaster-recovery-link-name>"
+----


Won't mark in each step, but setting up a specific example throughout this workflow and then showing all the data that matches that example would be a nice user experience with docs--give users the validation they are likely seeking during a potentially stressful situation.

Feediver1 · 2025-10-17T15:33:55Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+[[cleanup-stabilize]]
+=== Clean up and stabilize
+
+Once applications are running normally:


Suggested change

Once applications are running normally:

After all applications are running normally:

Feediver1 · 2025-10-17T15:35:07Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+
+**Solution**: Check shadow cluster logs for errors and ensure sufficient resources (CPU, memory, disk) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes.
+
+If topics remain stuck and you need to fail over everything immediately, you can force delete the shadow link, which will promote all topics:


Suggested change

If topics remain stuck and you need to fail over everything immediately, you can force delete the shadow link, which will promote all topics:

If topics remain stuck and you need to failover everything immediately, you can force delete the shadow link, which will promote all topics:

Failover: This is a technical term used in IT and disaster recovery contexts. It describes the process of automatically or manually switching to a standby system, server, or network upon the failure of the primary system. Failover ensures continuity of service and minimizes downtime during outages. For example, if a server goes down, the system can failover to a backup server to maintain operations.
2
Fail Over: This term is often used informally to describe the action of switching from one system to another. However, it is not as widely recognized or used in formal documentation as "failover." In many cases, "fail over" may be seen as a less precise way to describe the same process.

I understand why you put "fail over" in certain contexts throughout, but every time I see it followed by "failover" I find it distracting, or just an inconsistency (which can undermine a reader's confidence). Pretty minor/trivial, but thinking maybe just use failover throughout. At least that is Copilot's guidance. Mine too.

Feediver1 · 2025-10-17T15:39:11Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+
+**Problem**: Applications cannot connect to shadow cluster after failover
+
+**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for shadow cluster and test network connectivity from application hosts.


Suggested change

**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for shadow cluster and test network connectivity from application hosts.

**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for the shadow cluster and test network connectivity from application hosts.

Feediver1 · 2025-10-17T15:41:22Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+
+**Problem**: Consumers start from beginning or wrong positions
+
+**Solution**: Verify consumer group offsets were replicated (check your filters) and use `rpk group describe <group-name>` to check offset positions. If necessary, manually reset offsets to appropriate positions.


Can you add a link to the doc they can follow for manually resetting consumer group offsets?

https://support.redpanda.com/hc/en-us/articles/23499121317399-How-to-manage-consumer-group-offsets-in-Redpanda

Feediver1 · 2025-10-17T15:45:58Z

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc

+4. **Reverse replication**: Consider setting up shadowing in the opposite direction
+
+== Post-incident actions
+


After identifying the cause and resolving the cluster failure, resume your regular disaster recovery planning tasks, which should include:

Feediver1 · 2025-10-17T15:49:33Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+
+See also: link:/api/doc/admin/[Admin API v2 reference^]
+
+== Failover


Add your note from the beginning of the doc here--users don't always read from start to finish, and we wouldn't want them to miss it:

Experiencing an active disaster? See Emergency Shadowing Failover for immediate step-by-step emergency procedures.

Feediver1 · 2025-10-17T15:54:04Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+
+== How Shadowing fits into disaster recovery
+
+Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.


Please add either a link to definitions, or glossary definition for RPO and RTO

Feediver1 · 2025-10-17T15:55:27Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+
+Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.
+
+The architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.


Suggested change

The architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.

Redpanda's disaster recovery architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.

Feediver1 · 2025-10-17T15:56:08Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters.
+
+While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.


Suggested change

Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters.

While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.

Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters. While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.

Feediver1 · 2025-10-17T16:00:47Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+
+After you promote shadow topics with a failover, automatic fallback to the original source cluster is not supported. 
+
+[INFO]


I've not come across this Info note format in our docs. It lacks specificity--is it a tip? Important reminder? Caution? Or just a run-of-the mill note (pay particular attention to this point)? Consider another note type so users clearly understand context/importance.

Feediver1 · 2025-10-17T16:02:15Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+
+- You must have xref:get-started:licensing/overview.adoc[Enterprise Edition] licenses on both clusters. 
+
+- Both clusters must run Redpanda version 25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs version 25.3, the shadow cluster can run 25.3 or 26.1, but not 27.1.


Suggested change

- Both clusters must run Redpanda version 25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs version 25.3, the shadow cluster can run 25.3 or 26.1, but not 27.1.

- Both clusters must be running Redpanda v25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs v25.3, the shadow cluster can run v25.3 or v26.1, but not v27.1.

Feediver1 · 2025-10-17T16:09:01Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+
+To set up Shadowing:
+
+* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration


Suggested change

* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration

* **Understand replication behavior**: Before you set up Shadowing, be sure to review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration

Feediver1 · 2025-10-17T16:10:04Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+To set up Shadowing:
+
+* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration
+* **Configure filters**: Define which topics, consumer groups, and ACLs should replicate by creating include/exclude patterns that match your disaster recovery requirements


Link to "Set filters" below.

Feediver1 · 2025-10-17T16:10:32Z

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc

+
+* **Understand replication behavior**: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration
+* **Configure filters**: Define which topics, consumer groups, and ACLs should replicate by creating include/exclude patterns that match your disaster recovery requirements
+* **Create a shadow link**: Establish the connection between clusters using `rpk`, the Admin API, or Redpanda Console with authentication and network settings


Link to this task below

dr: adds shadowing docs

8085dc2

paulohtb6 force-pushed the shadowing branch from ac36815 to 8085dc2 Compare October 9, 2025 00:19

paulohtb6 added 10 commits October 8, 2025 21:34

adding links

46c06ed

enterprise feature notice

1bfb902

modify how internal topics are handled

68f45aa

change tags

82ec9dc

update with the latest from core

d206089

expand on filtering

8ade85d

networking

8bbdbb2

update example

3b47127

add runbook

19b0075

expand on monitoring

dafed89

paulohtb6 marked this pull request as ready for review October 15, 2025 02:44

paulohtb6 requested a review from a team as a code owner October 15, 2025 02:44

add runbook to nav

dfc5105

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

fix lists

00441cb

paulohtb6 commented Oct 16, 2025

View reviewed changes

modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc Outdated Show resolved Hide resolved

Apply suggestion from @paulohtb6

0534141

paulohtb6 commented Oct 16, 2025

View reviewed changes

modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc Outdated Show resolved Hide resolved

Apply suggestion from @paulohtb6

e0c3583

paulohtb6 changed the base branch from main to beta October 16, 2025 15:16

micheleRP reviewed Oct 16, 2025

View reviewed changes

modules/ROOT/nav.adoc Outdated Show resolved Hide resolved

Update modules/ROOT/nav.adoc

0ff44be

Co-authored-by: Michele Cyran <[email protected]>

micheleRP reviewed Oct 16, 2025

View reviewed changes

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc Show resolved Hide resolved

enable tls

19a79d1

bharathv reviewed Oct 17, 2025

View reviewed changes

micheleRP reviewed Oct 17, 2025

View reviewed changes

Feediver1 reviewed Oct 17, 2025

View reviewed changes

		=== Shadow link in PAUSED state

		Problem: Shadow link shows `PAUSED` instead of `ACTIVE`


		Problem: Topics remain in `FAILING_OVER` state for extended periods

		Solution: Check shadow cluster logs for errors and ensure sufficient resources (CPU, memory, disk) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes.


		Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. The replication stream ensures that the shadow cluster receives exact copies of the source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly promote to handle production traffic during a disaster.

		[IMPORTANT]

		Unlike traditional replication tools that re-produce messages, Shadowing copies data at the byte level, ensuring shadow topics contain identical copies of source topics with preserved offsets and timestamps.

		Shadowing replicates:

-[INFO]
-====
-Redpanda Data recommends that you don't modify synced topic properties on shadow topics. These properties revert to the source topic values.
+[CAUTION]
+====
+Do not modify synced topic properties on shadow topics. These properties revert to source topic values.

	This is an emergency procedure. For planned failover testing or regular shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
	This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.


		== Emergency failover procedure

		Follow these steps in order during an active disaster:

	Follow these steps in order during an active disaster:
	Follow these steps during an active disaster:

	# Check status of your disaster recovery link
	# Check the status of your disaster recovery link

	Verify these conditions before proceeding with failover:
	Verify that the following conditions exist before proceeding with failover:

	For complete cluster failover (recommended during disasters):
	If your data shows that you require a complete cluster failover (recommended during disasters):

	For selective topic failover (if only specific services are affected):
	For selective topic failover (when only specific services are affected):

	Once applications are running normally:
	After all applications are running normally:


		Solution: Check shadow cluster logs for errors and ensure sufficient resources (CPU, memory, disk) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes.

		If topics remain stuck and you need to fail over everything immediately, you can force delete the shadow link, which will promote all topics:


		Problem: Applications cannot connect to shadow cluster after failover

		Solution: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for shadow cluster and test network connectivity from application hosts.


		Problem: Consumers start from beginning or wrong positions

		Solution: Verify consumer group offsets were replicated (check your filters) and use `rpk group describe <group-name>` to check offset positions. If necessary, manually reset offsets to appropriate positions.

		4. Reverse replication: Consider setting up shadowing in the opposite direction

		== Post-incident actions


		See also: link:/api/doc/admin/[Admin API v2 reference^]

		== Failover


		== How Shadowing fits into disaster recovery

		Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.


		Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.

		The architecture follows an active-standby pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can promote the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.

dr: adds shadowing docs #1381

Are you sure you want to change the base?

dr: adds shadowing docs #1381

Conversation

paulohtb6 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Page previews

Checks

Uh oh!

netlify bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for redpanda-docs-preview ready!

Uh oh!

coderabbitai bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bharathv commented Oct 16, 2025

Uh oh!

paulohtb6 commented Oct 16, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bharathv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Feediver1 commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulohtb6 commented Oct 8, 2025 •

edited

Loading

netlify bot commented Oct 8, 2025 •

edited

Loading

coderabbitai bot commented Oct 8, 2025 •

edited

Loading

Feediver1 Oct 17, 2025 •

edited

Loading

Feediver1 Oct 17, 2025 •

edited

Loading

		Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your insurance policy for catastrophic regional disasters.

		While xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.


		After you promote shadow topics with a failover, automatic fallback to the original source cluster is not supported.

		[INFO]


		- You must have xref:get-started:licensing/overview.adoc[Enterprise Edition] licenses on both clusters.

		- Both clusters must run Redpanda version 25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs version 25.3, the shadow cluster can run 25.3 or 26.1, but not 27.1.

	- Both clusters must run Redpanda version 25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs version 25.3, the shadow cluster can run 25.3 or 26.1, but not 27.1.
	- Both clusters must be running Redpanda v25.3 or later. The shadow cluster can be one feature release ahead of the source cluster, but cannot skip feature releases. For example, if the source cluster runs v25.3, the shadow cluster can run v25.3 or v26.1, but not v27.1.


		To set up Shadowing:

		* Understand replication behavior: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration

	* Understand replication behavior: Review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration
	* Understand replication behavior: Before you set up Shadowing, be sure to review what topic properties, consumer groups, and security settings Redpanda automatically replicates versus what requires explicit configuration