-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601
Comments
@semistone Just wondering if this could be related to apache/bookkeeper#4196? |
we still try to compare what's the different between our producer and perf tool |
@lhotari
and producer setting (small batch message and big max bytes)
payload then it will show that error and publish throughput isn't good. but if we change to
and filter all data bigger than 15K bytes so we decide to create we still don't known why I don't how which batch producer configuration could fix this errors. and we also publish in multi thread programs, we also tried to reproduce by perf tool but it didn't always happen. thanks |
I tried to upgrade to bookkeeper 4.17.0
|
@semistone Please share a way how to reproduce it. It's not a problem if it's not always consistent. Fixing this issue will be a lot easier if there's at least some way to reproduce. |
@semistone Thanks for testing this. |
I will try to reproduce in perf tool. |
@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with Lines 435 to 436 in 80d4675
It impacts this code: Lines 659 to 681 in 188355b
|
I almost could reproduce by perf tool I guess in batch mode, payload size have some restriction. let me confirm again tomorrow to make sure I didn't make any stupid mistake during my test. |
Hi @lhotari it only include one commit which modify PerformanceProducer.java to include my consumer command is
and producer command is
that error happen when Batch builder is KEY_BASE in my test when it happen it will have WARN message in pulsar-broker.log
unfortunately I can't preproduce in docker, I guess docker standalone is different from my pulsar cluster. Please help to check it, if have any problem to reproduce this issue in your environment, Thanks |
I tested, still the same |
@semistone Thanks for testing. That tells that it's not related to switching the thread in Lines 659 to 681 in 188355b
|
@lhotari and during run(), I try to compare and print bytebuf
it show
the bytebuf object haven't changed, but the data in bytebuf have changed. and it's PooledSlicedByteBuf and PooledUnsafeDirectByteBuf do you have any idea how to find who change the data inside bytebuf ? |
I also test again I also checked the data again and it not related to batch mode |
There's some thought about this in #22110. That's a WIP PR and it might not make sense in the end. Since then, I have found a pattern used in Netty to detect issues. I'll look that up. |
@semistone do you also use Pulsar Proxy? |
@lhotari no, I didn't |
This is a useful detail. When messageKeyGenerationMode is random and BatcherBuilder.KEY_BASED is used, each batch will be size of 1. This could hint that the problem is related to #16605 changes, main difference here: pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BatchMessageContainerImpl.java Lines 252 to 286 in 6b29382
|
@semistone would it be possible to share your broker.conf customizations? That could help reproduce the issue. I noticed |
Yes, I enabled it and after debugging, I found
then that issue seem disappear. but not sure is there any side affect or not. here is our broker.conf and I remove all of the password |
I also debug in I don't see the same bytebuf object been reused during OpAddEntry.createNoRetainBuffer and OpAddEntry.run |
Interesting detail. does the problem also go away with |
I test it and It seems also work I will repeat success/failure test later to confirm it again. |
I've been trying to reproduce the issue with local microk8s cluster by deploying Pulsar with Apache Pulsar Helm chart using this values file: https://github.com/lhotari/pulsar-playground/blob/master/test-env/issue22601.yaml |
How many brokers and bookies do you have in the cluster where it reproduces? |
I have 6 bookkeeper in 3 different data center and I left only one broker running for debug and it run on physical server and CentOS Linux release 7.9.2009 |
I add big payload options |
I was using What gets logged when the issue reproduces? |
do you happen to run with debug logging level when the issue reproduces? pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java Lines 1799 to 1801 in a66ff17
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java Lines 1859 to 1872 in a66ff17
|
I finally found the root cause in Bookkeeper client. The client had several buffer leaks since the ByteBufList lifecycle was incorrectly handled and write promises were dropped and ignored when authentication was in progress. The fix is in apache/bookkeeper#4293. |
I write repproduce steps and some investigate history in I tested from install step and confirmed it only happen when bookkeeper TLS enable I will test that patch next week |
@lhotari I try branch-4.16
but seem still have issue. I tried to test on bookkeeper master branch (4.18.0-SNAPSHOT), but have many compatible issues because pulsar 3.2.2 still using 1.6.4. I have updated all conf.tgz , and explain how to generate keystore and truststore please let me known if you still can't have problem to reproduce in your local environment. |
@semistone Some initial comments about the reproduce steps. Btw. regarding "op.data = data.asReadOnly(); <=== then that error disappear but I still don't known why." in https://github.com/semistone/personal_notes/blob/main/pulsar_issue_22601/Test.md There are also other Netty bugs impacting the use case. There are 4 PRs that will be in 4.1.111 . Before that, it might be able to find workarounds. Since the workarounds would be temporary, it could be useful to do a local build using 4.1.111-SNAPSHOT version. In addition, the fixes pending for Bookkeeper 4.16.6 will be needed to fix the problem. It's possible that the issue remains after this since multiple issues in different libraries are contributing to the problem. |
Experiment in branch lh-issue22601-experiment in my fork to build with Bookkeeper 4.16.6-SNAPSHOT and Netty 4.1.111.Final-SNAPSHOT. Compiling Netty 4.1.111.Final-SNAPSHOT locally
Compiling Bookkeeper 4.16.6-SNAPSHOT locally (from lh-backport-pr4293 branch, that includes apache/bookkeeper#4293)
Compiling local Pulsar with Bookkeeper 4.16.6-SNAPSHOT and Netty 4.1.111.Final-SNAPSHOT
|
@semistone Thanks, very useful. I revisited your repro instructions so that it's possible to semi-automate the steps. I have the repro based on your repro in https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601/standalone_env . I'll now test if it reproduces also with the patched version referenced in #22601 (comment). |
I can confirm that this issue reproduces with https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601/standalone_env / https://github.com/semistone/personal_notes/blob/main/pulsar_issue_22601/Test.md also with Netty 4.1.111.Final-SNAPSHOP and Bookkeeper 4.16.6-SNAPSHOT (with PR 4293 included). This is a different issue than what I reproduced in my microk8s based repro. |
@semistone Your problem is caused by invalid TLS configuration. When TLS is enabled between Broker and Bookkeeper, you must set |
I found apache/bookkeeper#2071 and apache/bookkeeper#2085 which seem to indicate that running with |
Previous comment about bookkeeperUseV2WireProtocol #21421 (comment) |
It looks like there's a feature that enables V3 protocol using a separate client when V2 is configured: apache/bookkeeper#2085 . As mentioned in the previous comments, I don't think that this has ever been proven to be stable when TLS is enabled. |
@lhotari also talk about our release history.actually we have notice that issue #21421 during our pre release load test and since that error logs have ledger id and entry id but this error, the data in bookkeeper is correct. that why at beginning we didn't aware it's bookkeeper client issue so I will use that settings in our broker setting and I think it still need to report to bookkeeper team. I already debug this issue for one month. |
@semistone It seems that V2 should be supported also when using TLS. The downside of using V3 is that it's not very efficient and will add a lot of GC overhead. However some bug gets triggered and that's the reason why V3 is more stable when TLS is enabled between brokers and bookies. I was reviewing code for reference count bugs in Pulsar and I did find an issue in RangeCache (the broker cache implementation). I'll follow up with a PR to fix the issue soon. It's not related to the bookkeeper protocol version at all. |
RangeCache race condition fix: #22789 |
I have made a fix to bookkeeper which fixes the issue: apache/bookkeeper#4404 . I tested with Pulsar 3.2.3 with bookkeeper.version set to 4.16.6-SNAPSHOT. Before compiling that locally, I compiled Bookkeeper branch-4.16 with PR 4404 changes included (branch-4.16 backport available at https://github.com/lhotari/bookkeeper/commits/lh-fix-tls-stability-with-V2-branch-4.16/). Compiling Bookkeeper 4.16.6-SNAPSHOT locally with apache/bookkeeper#4404 changes.
Compiling local Pulsar v3.2.3 with Bookkeeper 4.16.6-SNAPSHOT (passing
|
@lhotari I will deploy to our cluster and test again few days later. |
@lhotari and I think this issue had been fixed. |
…buffers (#14086) Motivation: SslHandler doesn't support sharing the input buffers since the input buffers get mutated. This issue has been reported as #14069. Fixes have already been made to support the use case using read-only buffers as input. However, this is not useful for existing Netty applications. I checked the Netty code once again and noticed that it's possible to optimize the current implementation in SslHandler and avoid the buffer mutation. The default behavior in SslHandler causes very hard to debug problems. In Pulsar, there have been issues open for multiple years that have been caused by the SslHandler's default behavior of mutating the input buffers (apache/pulsar#22601 is one starting point to the issues). The current implementation of SslHandler will mutate the input buffers by using the input buffer as a cumulation buffer when there's available capacity. This is not optimal at all. This PR contains an improvement where the cumulation buffer is allocated to the final size immediately so that cumulation is more optimal. That's why I think that this will be more optimal in the end and won't cause performance regressions. After these changes, it should be possible to pass a slice or duplicate of a buffer and the input buffer doesn't get mutated. The input buffer's readIndex will be modified so multiple threads cannot share the same input buffer without a slice or duplicate. I created a failing test case that shows that a duplicate buffer doesn't prevent input buffer mutation.  It's a very surprising behavior of SslHandler that passing a `.retainedDuplicate()` buffer could result in mutations to the original buffer. That's another reason why I think that SslHandler shouldn't ever mutate the inputs. Modification: - add `protected ByteBuf composeFirst(ByteBufAllocator allocator, ByteBuf first, int bufferSize)` to AbstractCoalescingBufferQueue to support allocating an optimal cumulation buffer instead of using the input buffer as a cumulation buffer - remove dead code from `compose` where there was handling for CompositeByteBuf - `composeFirst` will be called before `compose` so this case isn't needed. Result: Fixes #14069 without the need to set the input buffer as read-only --------- Co-authored-by: Norman Maurer <[email protected]>
…buffers (#14086) Motivation: SslHandler doesn't support sharing the input buffers since the input buffers get mutated. This issue has been reported as #14069. Fixes have already been made to support the use case using read-only buffers as input. However, this is not useful for existing Netty applications. I checked the Netty code once again and noticed that it's possible to optimize the current implementation in SslHandler and avoid the buffer mutation. The default behavior in SslHandler causes very hard to debug problems. In Pulsar, there have been issues open for multiple years that have been caused by the SslHandler's default behavior of mutating the input buffers (apache/pulsar#22601 is one starting point to the issues). The current implementation of SslHandler will mutate the input buffers by using the input buffer as a cumulation buffer when there's available capacity. This is not optimal at all. This PR contains an improvement where the cumulation buffer is allocated to the final size immediately so that cumulation is more optimal. That's why I think that this will be more optimal in the end and won't cause performance regressions. After these changes, it should be possible to pass a slice or duplicate of a buffer and the input buffer doesn't get mutated. The input buffer's readIndex will be modified so multiple threads cannot share the same input buffer without a slice or duplicate. I created a failing test case that shows that a duplicate buffer doesn't prevent input buffer mutation.  It's a very surprising behavior of SslHandler that passing a `.retainedDuplicate()` buffer could result in mutations to the original buffer. That's another reason why I think that SslHandler shouldn't ever mutate the inputs. Modification: - add `protected ByteBuf composeFirst(ByteBufAllocator allocator, ByteBuf first, int bufferSize)` to AbstractCoalescingBufferQueue to support allocating an optimal cumulation buffer instead of using the input buffer as a cumulation buffer - remove dead code from `compose` where there was handling for CompositeByteBuf - `composeFirst` will be called before `compose` so this case isn't needed. Result: Fixes #14069 without the need to set the input buffer as read-only --------- Co-authored-by: Norman Maurer <[email protected]>
…buffers (netty#14086) Motivation: SslHandler doesn't support sharing the input buffers since the input buffers get mutated. This issue has been reported as netty#14069. Fixes have already been made to support the use case using read-only buffers as input. However, this is not useful for existing Netty applications. I checked the Netty code once again and noticed that it's possible to optimize the current implementation in SslHandler and avoid the buffer mutation. The default behavior in SslHandler causes very hard to debug problems. In Pulsar, there have been issues open for multiple years that have been caused by the SslHandler's default behavior of mutating the input buffers (apache/pulsar#22601 is one starting point to the issues). The current implementation of SslHandler will mutate the input buffers by using the input buffer as a cumulation buffer when there's available capacity. This is not optimal at all. This PR contains an improvement where the cumulation buffer is allocated to the final size immediately so that cumulation is more optimal. That's why I think that this will be more optimal in the end and won't cause performance regressions. After these changes, it should be possible to pass a slice or duplicate of a buffer and the input buffer doesn't get mutated. The input buffer's readIndex will be modified so multiple threads cannot share the same input buffer without a slice or duplicate. I created a failing test case that shows that a duplicate buffer doesn't prevent input buffer mutation.  It's a very surprising behavior of SslHandler that passing a `.retainedDuplicate()` buffer could result in mutations to the original buffer. That's another reason why I think that SslHandler shouldn't ever mutate the inputs. Modification: - add `protected ByteBuf composeFirst(ByteBufAllocator allocator, ByteBuf first, int bufferSize)` to AbstractCoalescingBufferQueue to support allocating an optimal cumulation buffer instead of using the input buffer as a cumulation buffer - remove dead code from `compose` where there was handling for CompositeByteBuf - `composeFirst` will be called before `compose` so this case isn't needed. Result: Fixes netty#14069 without the need to set the input buffer as read-only --------- Co-authored-by: Norman Maurer <[email protected]>
Search before asking
Read release policy
Version
Minimal reproduce step
publish event in about 6k QPS and 100Mbits/sec
with metaData
BatcherBuilder.KEY_BASED mode
and producer and send message by high concurrent/parallel producer process.
it happens only in almost real time consumer (almost zero backlog)
What did you expect to see?
no lost event
What did you see instead?
could see error log in broker and show
Failed to peek sticky key from the message metadata
it look like thread safe issue, because it happen randomly.
in 1M events, it only happen few times but the consumer will lose few events
Anything else?
the error similar to
#10967 but I think it's different issue.
the data in bookkeeper is correct.
I can download the event from bookkeeper and parse it successfully.
or consume the same event few minutes later and it could consume successfully.
but all subscriptions will get the same error in the same event in real time consumer(zero backlog).
I have traced source code.
it happens in
PersistentDispatcherMultipleConsumers.readEntriesComplete -> AbstractBaseDispatcher.filterEntriesForConsumer
-> Commands.peekAndCopyMessageMetadata
and I also print the ByteBuf contents,
it's I could clearly see the data isn't the same in bookkeeper
in normal event , the hex code usually start by 010e (magicCrc32c)
in one of our error event, the bytebuf have about 48 bytes strange data, then continue with normal data
this is just an example, but sometimes the first few bytes are correct and something wrong after few bytes later.
I am still trying to debug when and how the ByteBuf returns incorrect data, and why it only happens during stress testing. It is still not easy to reproduce using the perf tool, but we can 100% reproduce it in our producer code.
Does anyone have any idea what could be causing this issue, and any suggestions on which library or class may have potential issues? Additionally, any suggestions on how to debug this issue or if I need to print any specific information to help identify the root cause would be appreciated. Thank you.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: