-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
addBulk performance issues #1670
Comments
For addBulk we are using a ioredis multi transaction which is supposed to use pipeline to greatly alleviate roundtrip latencies: |
I know, went over the code and it seems like it should work this way, but when looking at traces it submits great amounts of EVALSHA commands, that check all the relevant keys around a single job and causes immense backpressure on the Redis. |
Are you sure you really are chunking in 1k chunks? can you send me a piece of code that shows this behaviour? |
Btw, the pipeline does not imply that all the commands are sent in one roundtrip, but that they do not need to wait for the response before sending the next one. Also, how long is it taking that 1k batch to be sent? depending on the network conditions something like 100ms-200ms would be normal. |
Hi @manast, thanks for the fast response.
As you can see, according to the flame graph, it can take up to 1 sec and over. |
Can you provide a minimal example that I can run that reproduces the issue? Do you get the same behaviour running this locally? |
I will have to create a docker-compose with an entire infrastructure setup. |
The snippet above is not enough for me to test. I have own performance tests and addBulk is performing well, so if you want me to examine your use case you need to provide a simple test case that I can run easily (docker-compose, etc, should not be needed, nor external dependencies). |
@manast I will do a small test case to check this and post a link to the repo here later on this week. |
Hello @manast we have it 👯 |
Ok, so it is not addBulk the problem, it is the workers, or did I miss something? |
still trying to figure out what is choking the redis, not a big expert in Lua scripts sadly :( |
Ok, so at least |
we have over 4K k8s pods connected as consumers, and I'm thinking that this is what might be pressuring the redis instance, we are doing a transition to a Redis Cluster, but still, the distribution per shard is by Queue at most. so it might still happen as almost 2K pods are listening to the same queue. |
Redis cluster will not bring any benefit to BullMQ performance-wise. I am still investigating your case to see if I can identify some bottlenecks. |
Btw, I noticed an issue with your test, it is only adding 5k jobs once as it is reusing the jobId for every batch. So the first batch adds 5k jobs but then the rest are not added again until the jobs are consumed. |
Secondly, in your consumer code, you are outputting the whole job, this will consume a lot of resources, just doing a large console.log so fast and with so much data can put the CPU to 100%. |
Without the console.log and a concurrency of 1000, it is processing jobs at around 20%-30% CPU usage... so I cannot identify any performance issues so far. |
But ok, so the scenario is 4000 workers Right? what about the concurrency factor per worker? I would like to better understand the complete scenario to see if I can figure out a bottleneck. It should be possible to generate a test case using AWS elasticache that more easily shows how the Redis CPU spikes to 100% in some circumstances, you are welcome to contact me privately if you are not allowed to share some details. |
Each worker runs a single job at a time, its a browser process that runs automated testing, playwright for example. |
Ok, so the worst case is 4k jobs simultaneously? |
Btw, regarding AWS instance, I think you will be better off with this one: cache.r6g.large you get a bit more RAM, but the important thing is that is much cheaper. The one you are using has 4 cores, but Redis is single core so there is no use for the remaining 3 at all. |
hmm, it seems that elasticache can indeed take some advantage of multi-core now: https://aws.amazon.com/blogs/database/boosting-application-performance-and-reducing-costs-with-amazon-elasticache-for-redis/ |
The worst case is way above 4K, reaching up to a few tens of thousands of pods. |
Ok. I am not sure how well Redis cope with so many connections, this is something we should investigate, but there are some things BullMQ does per Worker, for example checking for stalled jobs. This check is super fast but if you have tens of thousands then it starts adding up. We can make optimizations to minimize impact in these cases though. |
I truly believe it's the amount of workers connected, isn't the queueSchedulers were the ones checking for |
@DanielNetzer queue schedulers are not needed anymore. One think you can try is to increase the interval at which the stalled jobs are checked, since you will have thousands of workers maybe something like every 5 minutes will be more than enough ```https://api.docs.bullmq.io/interfaces/WorkerOptions.html#stalledInterval |
Hello @manast We currently process approx. 10K jobs per second give or take, depends. |
I will fix the api docs. How those settings could affect is a bit involved to explain. For instance, |
@manast I think I managed to isolate the issue, it's the blocking BRPOPLPUSH. |
Which version of BullMQ are you using in that test? The current maximum timeout time for brpoplpush in BullMQ is 10 seconds, so I do not know how it can be almost 40 seconds in your screenshot unless it is a different version. |
so I rolled out a version with the latest for IORedis and BullMQ on Friday to our DEV and Staging environments, if all is well will roll it to PROD also. |
Or if there are only delayed/rate-limited jobs. |
so this issue still remains a mystery, we are deploying new redis clusters instead of the single and will check the performance and also updating the redis version to 7.* (elasticache). |
I checked, the performance seems to be the same so no improvements from AWS side of things. we are switching to a Redis Cluster hopefully it will distribute the queues between the different loads. |
I really want to help you understand what is going on. If you use cluster you have to consider this: https://docs.bullmq.io/bull/patterns/redis-cluster |
yep, already using that. |
@DanielNetzer it looks like you are using datadog could you also take a look at the node runtime metrics dashboard? If the CPU of node is very high or you are hitting event loop delays then you can't trust the timing of the spans entirely because they can include time where the system was hanging and that's a problem with the service not bull/redis but the spans will look like they are slower.
I noticed when we enabled bull on one of our services its making ~6k evalsha calls a second total across 3 nodes (~2k evalsha's a second per node). Which seems very high for our limited use case but I'll have to dig deeper before I can share anything there. Although this shouldn't be much load on the redis instance I think creating the ~6k spans a second is adding a perf hit from the Datadog I might try configuring the tracer to not create spans for |
Following up on my comment. Our issue was with calling removeJob in |
@billoneil could you share more information regarding your call to |
@manast it was the The relevant point to this thread is making 1000's of calls rapidly may perform well in Redis but we saw a performance hit on our application side from the APM. That CPU hit resulted in the spans (generated client side) to appear slower than the actual request took (server side on redis). The issue was the CPU maxing out server side not Redis/bullmq. This is not an issue with bullmq. |
hi @alexhwoods, we moved from using BRPOPLPUSH since v5 https://docs.bullmq.io/changelog, could you try to upgrade it and let us know how it goes |
Will do! Thank you! |
Hello everyone, we have been experiencing performance issues when utilizing the
addBulk
method https://docs.bullmq.io/guide/queues/adding-bulksWhen adding over 1K jobs (we currently split our bulks into smaller chunks of 1K).
As shown in the flame graph above, for each of the jobs added, it computes multiple fields, creating an insane latency (which can reach up to a few seconds).
This is an example of 1 of the EVALSHA commands being submitted individually.
The infrastructure used is AWS Elasticache with a single node of type
cache.m6g.xlarge
.BullMQ should create a single Redis transaction and push that to the Redis, not do ping-pong for every job added.
The text was updated successfully, but these errors were encountered: