-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New error messages seen in faust app with aiokafka 0.7.1 #166
Comments
Thanks for fix this. [Party] |
We're still seeing this issue with v0.6.9 |
@bitdivision what are you seeing? Is this on a single worker or on multiple? |
We've now ignored all of the following logs in sentry configuration, however we are still seeing the logs for them in 0.6.9:
We're running with 6 workers. I can give more details on specific logs if that's helpful? Edit: to add to this, we've seen these messages for topics which are clearly being processed correctly. We would have seen lag increases if a topic stopped processing, but that's not been the case |
can you share the logs you are seeing in 0.6.9 with more than 1 worker? It should not happen with 1 worker I hope :) |
Any update on this? It still happens and it's hard to tell if it's a real issue or not. I'm using 0.6.10. Edit: |
* fix race condition when buffers are full * fix race condition when buffers are full * Fix error messages in faust app #166
If you run a worker with 1 topic N partitions, it only consumes from a single partition. After 300 seconds the monitor starts complaining that N-1 partitions are idle. I've had to set |
Hi all, I am having what seems to be the same problem with faust-streaming==0.8.5. Here is my setup: I have a source topic with 6 partitions. Our Kafka cluster is on Confluent Cloud. (Don't think it matters?) I am also wondering if there is a bad record that's causing something like an infinite loop? If that is the case, the fix is on us, of course.
to skip to a certain offset. So, just wanted to see if the problem in this issue is still alive just to eliminate the possibility that our issue is based on a Faust bug. @richardhundt : You mentioned about Update: I just noticed my issue happens even with a single worker. This time, Partition 0 is stuck at offset=1. Other partitions are moving forward nicely. |
As far as I can tell, the I expected that I also found that in order to from aiokafka.consumer import AIOKafkaConsumer
from aiokafka import TopicPartition
from faust import TopicT
async def seek_topic_partition(topic: TopicT, partition: int, offset: int):
app = topic.app
consumer = AIOKafkaConsumer(loop=app.loop, group_id=app.conf.id)
tp = TopicPartition(topic.get_topic_name(), partition)
await consumer.start()
consumer.assign([tp])
await consumer.seek(tp, offset)
await consumer.stop() |
@richardhundt Thank you for the details! I'll try AIOKafkaConsumer :) |
This was introduced in 0.6.5 when we actually started calling the verification: 7a45b2b#diff-5704609ad5592d977f497ac5defed2c54606a1bf7e42f0677ddf88f59f47938bR278 The code doesn't care if commits go through, offsets are set in a dictionary and this is all we look at:
This probably never worked, I didn't have time to look into this in detail but my guess is that the global variable is read and updated from different threads and isn't really global. In #153 people also complained about a significant performance regression when this additional check was enabled. Until we find the issue you can go back to 0.6.4 or patch this check out. |
A follow-up to this comment: I stopped seeing a message like "Has not committed TP(topic='test_topic', partition=27) at all since worker start (started 5.52 minutes ago)" and all the partitions started to process as expected as multiple-workers join/leave after finding the follow misconfiguration on our end: The issue was the mismatch between the number of replicas between app config and Topic object. This was causing a mismatch between # replicas settings between the app.config and the topic. Properly aligning them via the env var TOPIC_REPLICATION_FACTOR resolved our issue. This might be a novice mistake, but just leaving a note here anyways in case it's useful. Thanks @richardhundt and @joekohlsdorf for providing the pointers! Reading those helped to narrow down the issue :)
|
@wbarnha Could you please explain why you closed this issue? I don't see any recent changes to the problematic verification code I showed in #166 (comment) I can still reproduce the problem and the solution posted by @daigotanaka does not work for me. |
Thanks for getting back to me, I thought this was fixed by @daigotanaka but I'll go ahead and re-investigate. |
this should be fixed with #380 - can you please test it @joekohlsdorf? |
Anyone still seeing this with the latest release? |
most if not all of our
|
Still seeing the issue in
|
I've also seen this error come up while Faust is actually running normally, so it's a bit hard to troubleshoot. I think the solution lies in reviewing our |
For the record, We are facing the exact same issue using faust-streaming 0.10.13 and Python 3.11.3. |
Something to consider: if you have a large What happens is that you fetch a chunk of Try setting
Here I'm trying to increase the polling frequency by limiting max poll records and max fetch size, while increasing intervals and timeouts. |
This is true but it also happens in environments which process millions of small messages. |
I also face this error msg in my project. |
We have been seeing this error for a while:
There are multiple possible explanations for this:
There are multiple possible explanations for this:
We will get 100's / 1000's of these messages during a large run. fastapi==0.90.1 |
Just noting that we have the same thing, however for us it is a case where we send 7K messages on initialising the app, but thereafter may go several minutes without sending a single message. Then when one does send (after the 300s timer has elapsed) it triggers this. Is it simply triggering because it thinks there should have been a commit in the last 300s even though the topic hasn't changed the end offset? |
This is an issue when the stream is using transactions i.e. There are two distinct methods used for committing here: The checks @ https://github.com/faust-streaming/faust/blob/master/faust/transport/drivers/aiokafka.py#L800 are dependent on the last committed offsets to be updated @ https://github.com/faust-streaming/faust/blob/master/faust/transport/drivers/aiokafka.py#L720. However, if you are using transactions, this method is not used when committing offsets. Instead https://github.com/faust-streaming/faust/blob/master/faust/transport/consumer.py#L324 is used where the last committed offsets are not updated. This results in false warnings being logged as it sees no offsets being committed. |
So we can ignore these warnings? They're not impacting business? |
@fonty422 if you are using transactions, then yes - you can ignore these for now. |
I am facing the similar problem in my app, where all of sudden after sometime the consumer stopped. ^--AIOKafkaConsumerThread]: Has not committed TP(topic='workerInput', partition=0) (last commit 6.26 minutes ago).
"code_file": "/usr/local/lib/python3.10/dist-packages/faust/transport/drivers/aiokafka.py", I am using
|
Same issue here with faust 0.11.2 |
Checklist
master
branch of Faust.Steps to reproduce
Bring up the faust app and let it run for a while. After running for some time the application starts logging the following error messages
The text was updated successfully, but these errors were encountered: