-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read Timeouts and high CPU spike after upgrading from 8.0.5 to 8.0.10 #840
Comments
I don't think v8.0.11 exists. Not sure what is going on i.r.t. versioning in the master branch. Regardless, the performance work was implemented in v8.0.10 (AFAIK), but perhaps there's a bug that didn't surface in our tests and in my production environment, or a breaking change we didn't see i.r.t. upgrading. Is there anyway you can share a snippet of those If possible, can you write a performance test like this one that fails? You can run performance tests like this: bin/rspec --require spec_helper --tag perf |
You're right about the master branch, I noticed the commit for version 8.0.11 but it's after tag 8.0.10. v8.0.10 includes the performance fixes, which might actually be causing my issue then... Unfortunately the slow log (as configured by AWS) does not show the arguments, but luckily it does show the arg count though, which is 3.
So reading the new code, there is no score from ZSCORE available and it falls back to the old behaviour. It might indeed be an issue upgrading? OR could it be that the behaviour of the specified code has slightly changed? At first glance the implementations looks the same, however (and keep in mind I'm not too familiar with Lua) I noticed that there used to be a break in the old and a return in the new implementation in the NESTED for loop. The I need to dive into what this functions responsibility exactly is, but could it be there is more than one item that needs to be ZREMed? |
Yes, I do have a replace strategy:
I perform a rolling deploy, so the two versions were running simultaneously for a moment. However it did not recover when eventually only the new version was running. I rolled back to 8.0.5 and the issue went away. |
I'm hesitant to try the upgrade once more in production. On our staging environment the problem does not occur, so my first thought was that the upgrade used a lot more CPU than the old version of the Redis server under load. However the remark of @mhenrixon makes sense too. Since I perform a rolling deploy, v8.0.5 and v8.0.10 are running simultaneously for a while. What would happen in this case? Is v8.0.5 calling the lua functions of v8.0.10 on the redis server? Would that be problematic? Or is there a migration issue perhaps? |
@mjooms any update here? Were you able to write a failing performance test that reproduces your behavior here? I'm curious what is going on here. |
I have written a performance test that simulates the behaviour by using the same Worker config and adding 100k of them in the queue, but I wasn't able to identify any issues. Its hard to write a test though since I don't really know what exactly caused the CPU on the redis server to spike in the first place... I wasn't able to reproduce it on our staging environment either yet. I don't really want to test it out on production once more, but I guess thats the only way to rule out what you suggested... that the lua scripts were outdated 😬 Do you have any tips how I can debug this further? |
@mjooms I'm not sure running this in production again is necessary. I'm a little confused as to how v8.0.5 didn't have the buggy behavior that v8.0.10 supposedly did, because all v8.0.10 did was add an additional fast-track. In the worst case, it should behave exactly like v8.0.5. So if it failed in v8.0.10 but NOT v8.0.5, then that signals something in v8.0.10 is backwards incompatible or broken for your unique workload. But without a failing test, I'm not sure what… Could you try upgrading to v8.0.9 and see if that exhibits the same behavior? |
Chiming in here to maybe help get eyes on this — I can confirm there still seems to be a performance issue here. We experienced it again, which caused another production outage. Ultimately, we decided to remove our dependency on SUJ since we were no longer using it for anything mission-critical. Unfortunately, that also means we could not dig deeper into the performance issues. It was too risky to debug in production. As far as I can tell, for some reason or another, my original performance patch isn't always working. I spent a few hours, but I couldn't determine why that's the case so I gave up. I wish there was a happier ending here, but alas. Thanks @mhenrixon for all you've done. Sorry we can't be of more help moving forward. I wish the project good luck. I hope somebody can get to the bottom of this! |
Describe the bug
Experienced read timeouts > 5 sec after upgrading from 8.0.5 to 8.0.10 and redis spiking to high cpu leading to downtime in our prod environment. Slow log showed many ZRANGE commands.
Expected behavior
Since this is a minor upgrade I'd expected everything to keep functioning as usual, perhaps better performance but not worse :)
Current behavior
Performance degraded leading to downtime in our setup. Redis server became busy to a point where new jobs couldn't be added anymore due to connection and read timeouts.
Worker class
Additional context
We have an average of around 121k of these scheduled jobs in 24h. This worked fine in 8.0.5 but after upgrading to 8.0.10 the Redis server became irresponsive due to high load.
I noticed you and @ezekg have worked on a performance PR recently (#835) which might solve the problem but I wasn't really sure if this addresses the same issue. Also I wasn't able to test 8.0.11 because it hasn't been released yet (it is in the main branch but not yet on rubygems nor as a tag).
The text was updated successfully, but these errors were encountered: