Skip to content

Conversation

@kbatuigas
Copy link
Contributor

@kbatuigas kbatuigas commented Jul 31, 2025

Description

This pull request enhances the documentation for Redpanda's transaction capabilities, focusing on improving clarity, adding examples, and introducing best practices for managing transactional workloads and disk usage. The changes include updates to the transactional API, examples of use cases, and guidance for optimizing performance and handling transaction failures.

Enhancements to Transaction Documentation:

  • Expanded Transaction Capabilities:

    • Added details about transactions spanning multiple partitions and handling topic deletions during active transactions.
    • Clarified the necessity of the transactional.id property for producer identity and reliability across sessions.
  • Examples and Use Cases:

    • Introduced a collapsible example for multi-ledger transactions to demonstrate atomic updates across multiple partitions.
    • Added an example for exactly-once processing to highlight Redpanda's guarantees against message loss or duplication.

Best Practices and Configuration Guidance:

  • Transaction Management:

    • Reorganized and expanded best practices for configuring transactional workloads, including tuning timeouts and managing memory usage.
    • Provided approaches for handling transaction failures based on specific use cases, such as publishing multiple messages or exactly-once streaming.
  • Disk Usage Optimization:

    • Added a new section on managing disk usage for the kafka_internal/tx topic, including tuning transaction_coordinator_delete_retention_ms and transactional_id_expiration_ms properties.

Resolves https://redpandadata.atlassian.net/browse/
Review deadline:

Page previews

Manage Disk Space
Transactions
transaction_coordinator_disk_usage

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@kbatuigas kbatuigas requested a review from a team as a code owner July 31, 2025 06:16
@netlify
Copy link

netlify bot commented Jul 31, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 3797b6f
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/689248b5403521000828e43f
😎 Deploy Preview https://deploy-preview-1257--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

The documentation for Redpanda transactions and disk utilization was reorganized and expanded. The transactions documentation now includes clearer explanations of multi-partition transactions, atomicity, exactly-once semantics, and transaction failure handling. New best practices and configuration guidance were added, along with practical code examples. Additionally, a new section was added to the disk utilization documentation, explaining how transaction metadata in the kafka_internal/tx topic can affect disk usage, and how to tune retention properties to manage this. No changes were made to exported or public entities; all modifications are to documentation content only.

Sequence Diagram(s)

Not applicable: changes are limited to documentation content and do not introduce or modify control flow or feature logic.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~7 minutes

Assessment against linked issues

Objective Addressed Explanation
Document kafka_internal usage and its impact on disk space, including transaction metadata and retention (DOC-1412)
Explain the effect of transaction_coordinator_delete_retention_ms and transactional_id_expiration_ms (DOC-1412)
Provide guidance on tuning transaction metadata retention for disk management (DOC-1412)
Mention that kafka_internal includes WASM usage as well as transactions (DOC-1412) The documentation does not mention WASM usage in kafka_internal, only transaction metadata.

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
Expanded and reorganized Redpanda transactions documentation (modules/develop/pages/transactions.adoc) The changes go beyond disk space management and include general improvements and examples for transactions, which are not explicitly requested in the linked issue.

Suggested reviewers

  • micheleRP
  • joe-redpanda
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch DOC-1412-include-kafka_internal-usage-in-manage-disk-space

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
modules/develop/pages/transactions.adoc (1)

150-158: Minor Java sample issues – missing semicolon and non-string UUID

  1. var target = "target-topic" is missing its terminating semicolon.
  2. UUID.newUUID() should be UUID.randomUUID().toString() to produce the required String.
-var target = "target-topic"
+var target = "target-topic";

-pprops.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, UUID.newUUID());
+pprops.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, UUID.randomUUID().toString());
🧹 Nitpick comments (2)
modules/manage/pages/cluster-maintenance/disk-utilization.adoc (1)

178-182: rpk cluster config set example sets two keys in one call – double-check CLI behaviour

rpk cluster config set historically accepts a single <property> <value> pair per invocation.
If the current version still follows that pattern, the example may mislead users:

rpk cluster config set transaction_coordinator_delete_retention_ms <milliseconds>
rpk cluster config set transactional_id_expiration_ms            <milliseconds>

Confirm whether multi-key support exists; if not, split the example accordingly.

modules/develop/pages/transactions.adoc (1)

267-268: Order requirement wording may confuse readers

The bullet says:

`transaction_coordinator_delete_retention_ms` is not lower than `transactional_id_expiration_ms`

If the intent is “delete retention ≥ id expiration” consider stating it positively (“must be greater than or equal to”) to avoid misconfiguration.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f417e45 and 01086ac.

📒 Files selected for processing (2)
  • modules/develop/pages/transactions.adoc (6 hunks)
  • modules/manage/pages/cluster-maintenance/disk-utilization.adoc (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Feediver1
PR: redpanda-data/docs#1153
File: modules/reference/pages/properties/topic-properties.adoc:45-50
Timestamp: 2025-07-16T19:33:20.420Z
Learning: In the Redpanda documentation, topic property cross-references like <<max.compaction.lag.ms>> and <<min.compaction.lag.ms>> require corresponding property definition sections with anchors like [[maxcompactionlagms]] and [[mincompactionlagms]] to prevent broken links.
Learnt from: JakeSCahill
PR: redpanda-data/docs#1192
File: modules/deploy/partials/requirements.adoc:91-93
Timestamp: 2025-07-02T14:54:03.506Z
Learning: In Redpanda documentation, use GiB (binary units, powers of 2) for Kubernetes-specific memory requirements because Kubernetes treats memory units like Mi, Gi as binary units. Use GB (decimal units, powers of 10) for general broker memory requirements in non-Kubernetes contexts.
Learnt from: paulohtb6
PR: redpanda-data/docs#0
File: :0-0
Timestamp: 2025-07-15T20:38:27.458Z
Learning: In Redpanda documentation, "Redpanda Data" refers to the company name, while "Redpanda" refers to the product name. These terms should be used appropriately based on context.
modules/manage/pages/cluster-maintenance/disk-utilization.adoc (2)

Learnt from: Feediver1
PR: #1153
File: modules/reference/pages/properties/topic-properties.adoc:45-50
Timestamp: 2025-07-16T19:33:20.420Z
Learning: In the Redpanda documentation, topic property cross-references like <<max.compaction.lag.ms>> and <<min.compaction.lag.ms>> require corresponding property definition sections with anchors like [[maxcompactionlagms]] and [[mincompactionlagms]] to prevent broken links.

Learnt from: JakeSCahill
PR: #1192
File: modules/deploy/partials/requirements.adoc:91-93
Timestamp: 2025-07-02T14:54:03.506Z
Learning: In Redpanda documentation, use GiB (binary units, powers of 2) for Kubernetes-specific memory requirements because Kubernetes treats memory units like Mi, Gi as binary units. Use GB (decimal units, powers of 10) for general broker memory requirements in non-Kubernetes contexts.

modules/develop/pages/transactions.adoc (2)

Learnt from: Feediver1
PR: #1153
File: modules/reference/pages/properties/topic-properties.adoc:45-50
Timestamp: 2025-07-16T19:33:20.420Z
Learning: In the Redpanda documentation, topic property cross-references like <<max.compaction.lag.ms>> and <<min.compaction.lag.ms>> require corresponding property definition sections with anchors like [[maxcompactionlagms]] and [[mincompactionlagms]] to prevent broken links.

Learnt from: kbatuigas
PR: #1113
File: modules/manage/partials/iceberg/use-iceberg-catalogs.adoc:100-107
Timestamp: 2025-05-07T01:06:00.937Z
Learning: In AsciiDoc documentation for Redpanda, the syntax + and -- around content blocks within a [tabs] section are valid AsciiDoc formatting elements for tabbed content. The + after a tab name (like rpk::) indicates that the following block belongs to that tab, and the -- markers enclose the content for that tab. These are not diff artifacts and should not be removed.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Redirect rules - redpanda-docs-preview
  • GitHub Check: Header rules - redpanda-docs-preview
  • GitHub Check: Pages changed - redpanda-docs-preview
🔇 Additional comments (2)
modules/manage/pages/cluster-maintenance/disk-utilization.adoc (1)

172-174: Possible broken config_ref links due to path segment properties/cluster-properties

The config_ref: macro typically points to reference:cluster-properties.adoc#<property>.
Using properties/cluster-properties may generate a 404 in the rendered docs:

* config_ref:transaction_coordinator_delete_retention_ms,true,reference:cluster-properties[`transaction_coordinator_delete_retention_ms`]
* config_ref:transactional_id_expiration_ms,true,reference:cluster-properties[`transactional_id_expiration_ms`]

Please verify the path and update to avoid broken cross-references.

modules/develop/pages/transactions.adoc (1)

274-282: Cross-reference anchor check for max_transactions_per_coordinator

xref:reference:cluster-properties#max_transactions_per_coordinator assumes the property anchor exists.
If the anchor in cluster-properties.adoc is autogenerated it will be max_transactions_per_coordinator only if the heading exactly matches the property name. Please verify to prevent a dead link.

@Feediver1
Copy link
Contributor

lgtm--who is the SME reviewer for this one?

Copy link
Contributor

@Feediver1 Feediver1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still needs sme approval

* When upgrading a self-managed deployment, make sure to use maintenance mode with a glossterm:rolling upgrade[].

endif::[]
The required `transactional.id` property acts as a producer identity. It enables reliability semantics that span multiple producer sessions by allowing the client to guarantee that all transactions issued by the client with the same ID have completed prior to starting any new transactions.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is transactional.id a property set by a client, so not a Redpanda config property?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.

Copy link

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

* When upgrading a self-managed deployment, make sure to use maintenance mode with a glossterm:rolling upgrade[].

endif::[]
The required `transactional.id` property acts as a producer identity. It enables reliability semantics that span multiple producer sessions by allowing the client to guarantee that all transactions issued by the client with the same ID have completed prior to starting any new transactions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.

@kbatuigas kbatuigas force-pushed the DOC-1412-include-kafka_internal-usage-in-manage-disk-space branch from d9d51dd to 62c6d78 Compare August 4, 2025 21:29
The required `transactional.id` property acts as a producer identity. It enables reliability semantics that span multiple producer sessions by allowing the client to guarantee that all transactions issued by the client with the same ID have completed prior to starting any new transactions.

The two primary use cases for transactions are:
By default, the `enable_transactions` cluster configuration property is set to true. However, in the following use cases, clients must explicitly use the Transactions API to perform operations within a transaction:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider linking to the prperty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

=== Atomic publishing of multiple messages

With its event sourcing microservice architecture, a banking IT system illustrates the necessity for transactions well. A bank has multiple branches, and each branch is an independent microservice that manages its own non-intersecting set of accounts. Each branch keeps its own ledger, which is represented as a Redpanda partition. When a branch representing a microservice starts, it replays its ledger to reconstruct the actual state.
A banking IT system with an event-sourcing microservice architecture illustrates the necessity for transactions. A bank has multiple branches, and each branch is an independent microservice that manages its own non-intersecting set of accounts. Each branch keeps its own ledger, represented as a Redpanda partition. When a branch (microservice) starts, it replays its ledger to reconstruct the actual state.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be simplified. Some terms here might not be widely recognized, specially by non-native speakers. One thing that makes me question this is "ledger". It's unclear if it's a tech-term or bank-term.

Check if the following makes sense in this context:

Suggested change
A banking IT system with an event-sourcing microservice architecture illustrates the necessity for transactions. A bank has multiple branches, and each branch is an independent microservice that manages its own non-intersecting set of accounts. Each branch keeps its own ledger, represented as a Redpanda partition. When a branch (microservice) starts, it replays its ledger to reconstruct the actual state.
A banking system with an event-sourcing microservice architecture illustrates the necessity for transactions. A bank has multiple branches, and each branch is an independent microservice that manages its own accounts. Each branch stores its transaction history in a Redpanda partition. When a branch starts, it replays this history to reconstruct the current account balances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


To help avoid common pitfalls and optimize performance, consider the following when configuring transactional workloads in Redpanda:

* Ongoing transactions can prevent consumers from advancing. To avoid this, don't set transaction timeout (`transaction.timeout.ms` in Java client) to high values: the longer the timeout, the longer consumers may be blocked. By default, it's about a minute, but it's a client setting that depends on the client.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ongoing transactions can prevent consumers from advancing.

Advancing what? offsets?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why do we mention the java client specifically? This is a kafka setting, so it's likely that all supported clients have this value for their producers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's about a minute,

This is vague. We could be precise here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a client setting that depends on the client.

Could be simplified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Ongoing transactions can prevent consumers from advancing. To avoid this, don't set transaction timeout (`transaction.timeout.ms` in Java client) to high values: the longer the timeout, the longer consumers may be blocked. By default, it's about a minute, but it's a client setting that depends on the client.
* Ongoing transactions can prevent consumers from advancing their offsets. To avoid this, don't set transaction timeout (`transaction.timeout.ms`) to high values: the longer the timeout, the longer consumers may be blocked. The default is 60,000 ms (60 seconds).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharathv would you mind confirming if the suggested changes are correct?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Advancing what? offsets?

When a Kafka consumer is configured to use the read_committed isolation level, it will only process messages that have been part of successfully committed transactions. This mechanism ensures data consistency, preventing consumers from acting on incomplete or potentially aborted transactions. However, a stuck transaction can prevent this process from moving forward, impacting consumer progress and potentially causing a backlog.

A stuck transaction with a large timeout can block other committed transactions, this is by design, so large timeouts and stuck transactions results in seemingly stuck consumers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why do we mention the java client specifically? This is a kafka setting, so it's likely that all supported clients have this value for their producers.

Just to be a clear that is a Kafka "client" setting. It is named transaction.timeout.ms in Kafka Java client implementation but could be named something else in a different implementation. I think it refers to Java client because it is the most popular one.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's about a minute,

Exactly a minute, 60000ms to be exact (since the configuration is in ms).

ifndef::env-cloud[]
* When running transactional workloads from clients, tune xref:reference:cluster-properties#max_transactions_per_coordinator[`max_transactions_per_coordinator`] to the number of active transactions that you expect your clients to run at any given time (if your client transaction IDs are not reused).
+
The total number of transactions in the cluster at any one time is `max_transactions_per_coordinator * transaction_coordinator_partitions` (`transaction_coordinator_partitions` default is 50). When the threshold is exceeded, Redpanda terminates old sessions. If an idle producer corresponding to the terminated session wakes up and produces, its batches are rejected with the message `invalid producer epoch` or `invalid_producer_id_mapping`, depending on where it is in the transaction execution phase.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The total number of transactions in the cluster at any one time is `max_transactions_per_coordinator * transaction_coordinator_partitions` (`transaction_coordinator_partitions` default is 50). When the threshold is exceeded, Redpanda terminates old sessions. If an idle producer corresponding to the terminated session wakes up and produces, its batches are rejected with the message `invalid producer epoch` or `invalid_producer_id_mapping`, depending on where it is in the transaction execution phase.
The cluster's total transaction limit is `max_transactions_per_coordinator * transaction_coordinator_partitions` (default is 50 partitions). When this limit is exceeded, Redpanda terminates old sessions. If a terminated producer later tries to produce data, Redpanda rejects its batches with `invalid producer epoch` or `invalid_producer_id_mapping` errors.

This phrase flow is kinda awkward. Here's a suggestion. Feel free to pick whatever works best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Collaborator

@paulohtb6 paulohtb6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments to improve wording. But overall LGTM

@jason-da-redpanda
Copy link
Contributor

jason-da-redpanda commented Aug 5, 2025

in

https://deploy-preview-1257--redpanda-docs-preview.netlify.app/current/manage/cluster-maintenance/disk-utilization/#manage-transaction-coordinator-disk-usage

the customer ticket ...for where the inspo came for creating this doc request )
their transactions never run for over 1 hour.. (more like 15 minutes)

not sure that their processing and the 15 mins transactions were a particularly niche case

They did not understand why the tx directory was getting so big...

That was down to the fact that the kafka_internal tx files contain ALL historical transactions.. e.g complete/committed and current open ones. <<< I would like this to be called out in the doc...

and they had the defaults for

#transaction_coordinator_delete_retention_ms). Default: 604800000 (7 days).
#transactional_id_expiration_ms). Default: 604800000 (7 days).

Which is completely over the top given their transactions were only open 15mins to 1hour max

Maybe if we could have an example...

If you typical transactions run for 1 hour ... then you can consider setting

#transaction_coordinator_delete_retention_ms)
#transactional_id_expiration_ms

to 90 mins (or something )

@kbatuigas kbatuigas merged commit 1be2c0c into main Aug 5, 2025
7 checks passed
@kbatuigas kbatuigas deleted the DOC-1412-include-kafka_internal-usage-in-manage-disk-space branch August 5, 2025 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants