Skip to content

Pulsar: fix KeyNotFoundException acking batch messages on partitioned topics (supersedes #2883)#2950

Merged
jeremydmiller merged 1 commit into
mainfrom
fix-2883-pulsar-batched-partition-ack
May 28, 2026
Merged

Pulsar: fix KeyNotFoundException acking batch messages on partitioned topics (supersedes #2883)#2950
jeremydmiller merged 1 commit into
mainfrom
fix-2883-pulsar-batched-partition-ack

Conversation

@jeremydmiller

Copy link
Copy Markdown
Member

Supersedes #2883. Co-authored with @patrick-cloke-simplisafe, who diagnosed the upstream DotPulsar bug and wrote the original fix.

The bug

DotPulsar's BatchHandler.Add (apache/pulsar-dotpulsar#287) constructs the inner MessageId for each message of a batch via the four-arg ctor and never sets MessageId.Topic. Consumer.Acknowledge then does _subConsumers[messageId.Topic] with the empty string and throws KeyNotFoundException on partitioned topics, so batched messages on partitioned topics are never acked → infinite redelivery.

Non-batched messages are unaffected: ConsumerChannel uses ToMessageId(topic) for those, which correctly populates Topic.

Fix

PulsarListener.FixMessageId reconstructs the MessageId with the partition sub-topic URI ({topic}-partition-{N}) when Topic is empty and Partition >= 0. Short-circuits and returns the original MessageId instance otherwise (forward-compat for when the upstream DotPulsar fix lands; non-partitioned messages pass through unchanged). Applied at the three Acknowledge call sites in PulsarListener: CompleteAsync, DeferAsync, and moveToQueueAsync (the DLQ / retry-letter path).

Difference from #2883

One substantive change: each ack call site uses the source consumer's own IConsumer.Topic (consumer.Topic / sourceConsumer.Topic) rather than _endpoint.PulsarTopic(). DotPulsar exposes Topic on the IConsumer abstraction (verified in DotPulsar/Abstractions/IConsumer.cs; stable since 3.x), so the same helper works correctly for the retry consumer without any branching. The retry consumer is subscribed to {baseTopic}-RETRY (or a user-configured retry topic), so the original _endpoint.PulsarTopic()-based reconstruction would have produced wrong sub-topic names ({baseTopic}-partition-N instead of {baseTopic}-RETRY-partition-N) and re-triggered the same KeyNotFoundException on the retry path for users with partitioned retry topics. Latent in the original PR; addressed here.

Carries Patrick's three unit tests verbatim, plus one new test (FixMessageId_BatchMessageOnPartitionedRetryTopic_UsesRetryTopicForReconstruction) that pins the retry-topic reconstruction contract.

On Envelope.TopicName

@jeremydmiller asked on #2883 whether Envelope.TopicName could carry the topic and avoid URI reconstruction entirely. It can't. DotPulsar's IMessage<T> has no Topic property — the concrete Message<T> ctor takes no topic parameter, and MessageFactory<T> doesn't even know the topic it's constructing for. The only sources for the partition sub-topic at the ack site are (a) MessageId.Topic when set (works for non-batched) and (b) reconstruction from the consumer's base topic + MessageId.Partition. Populating Envelope.TopicName from that same reconstruction would just centralize the formatting in PulsarEnvelopeMapper; it doesn't eliminate it.

A separate enhancement to surface topic info to user handlers via Envelope.TopicName on incoming Pulsar messages is worth pursuing in its own PR — but it's out of scope for the bug fix here.

Verification

  • FixMessageIdTests: 4/4 pass locally (net9.0).
  • Full dotnet build wolverine.slnx -c Release clean (0 warnings, 0 errors).
  • The wider Pulsar compliance suite (PulsarTransportComplianceTests, InlinePulsarTransportComplianceTests, with_cloud_events) has 11 pre-existing local failures around retry/DLQ — verified that the same tests fail identically on baseline origin/main with my fix stashed (BadImageFormatException during handler load → Wolverine runtime-codegen issue with Apple-Silicon macOS, unrelated to this change). CI on Linux x64 validates them.

🤖 Generated with Claude Code

… topics (supersedes #2883)

Workaround apache/pulsar-dotpulsar#287:
DotPulsar's BatchHandler.Add constructs an inner MessageId for each message
of a batch via the four-arg ctor and never sets MessageId.Topic, so
Consumer.Acknowledge hits _subConsumers[messageId.Topic] with the empty
string and throws KeyNotFoundException on partitioned topics. Non-batch
messages are unaffected: ConsumerChannel uses ToMessageId(topic) for those,
which correctly populates Topic.

PulsarListener.FixMessageId reconstructs the MessageId with the partition
sub-topic URI when Topic is empty and Partition >= 0; the helper short-
circuits and returns the original instance when Topic is already populated
(forward-compat for the upstream fix) or the message is non-partitioned.
Applied at the three Acknowledge call sites in PulsarListener:
CompleteAsync, DeferAsync, and moveToQueueAsync (the DLQ / retry-letter
path).

This supersedes @patrick-cloke-simplisafe's PR #2883 with one substantive
change: each call site uses the source consumer's own IConsumer.Topic
property (`consumer.Topic` / `sourceConsumer.Topic`) rather than
_endpoint.PulsarTopic(). DotPulsar exposes Topic on the IConsumer
abstraction (Abstractions/IConsumer.cs), so the same FixMessageId helper
works for the retry consumer without branching - the retry consumer is
subscribed to {baseTopic}-RETRY (or a user-configured retry topic), and
using _endpoint.PulsarTopic() would have reconstructed wrong sub-topic
names ("{baseTopic}-partition-N" instead of
"{baseTopic}-RETRY-partition-N") and re-triggered the same
KeyNotFoundException on partitioned retry topics. Carries Patrick's three
unit tests verbatim plus one additional case covering the retry-topic
reconstruction.

Investigated whether Envelope.TopicName could replace the reconstruction
entirely (per @jeremydmiller's review comment on #2883). It cannot: DotPulsar's
IMessage<T> has no Topic property — the only sources for the partition
sub-topic at the ack site are MessageId.Topic when set (works for
non-batched) and reconstruction from the consumer's base topic +
MessageId.Partition (necessary for the batched-message bug). Populating
Envelope.TopicName from the same reconstruction would just centralize the
formatting in PulsarEnvelopeMapper; it doesn't eliminate it. A separate
Envelope.TopicName improvement for user-facing topic visibility on
incoming Pulsar messages is worth pursuing in its own PR but out of scope
for this fix.

Co-Authored-By: Patrick Cloke <202853352+patrick-cloke-simplisafe@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 94d213b into main May 28, 2026
23 of 26 checks passed
jeremydmiller added a commit that referenced this pull request May 28, 2026
Bug-fix + feature release on top of 6.1.0 — 13 PRs.

Notable additions:
- Custom Result<T> handler-return-value support (Phases 0+1+2+3, #2952, refs #2221)
- DbContext abstractions for EF Core transaction middleware (#2919 + docs/tests #2954)
- Outgoing Envelope pooling at MessageRouter.RouteForPublish (#2956, closes #2955)
  — ~-504 B/op on transport-bound sends per the CritterStackScalability
  WolverineTransportBenchmarks harness

Bug fixes: scheduled-cascade loss from [ReadAggregate]/[DocumentExists]
handlers (#2941), ancillary-store inbox routing (#2944), Postgres queue-name
length (#2942), MySQL node-record quoting (#2940), Pulsar batched-partition
ack KeyNotFoundException (#2883/#2950), remote-node agent reply timeout
(#2949), and additional resource-disposal cleanup (#2894 from
dmytro-pryvedeniuk).

Polecat bumped 4.1.1 -> 4.2.1 (#2947); Marten + JasperFx families unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant