-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved redis performance in large keyspace, added pagination, and auto-expiration #119
Conversation
Hi @pokonski, looking forward to getting your feedback on this set of changes! I realize they have client-facing implications that make them trickier to integrate. I think the performance benefits of indexing workflow and job ids are worth the trade-offs, but they are definitely trade-offs. If you don't think the index approach is workable, I think probably:
|
Oh WOW! That is a massive improvement! I am 100% fine with changing data structures for a boost this great (nice touch with the migration ❤️ ) - I also experimented with native Redis Graphs for performance reasons (#96) but that turned out waaay slower (and also Graphs are now deprecated :D ) Definitely warrants a major bump now :) |
Can I ask you to resolve conflicts here? I merged your other PRs. After that I will merge the Rubocop PR (so the style changes are last) |
end | ||
|
||
def up | ||
# subclass responsibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider raising a NotImplemented
error for this? To guide implementation I think it might help clarify.
@@ -22,6 +22,10 @@ def self.find(id) | |||
Gush::Client.new.find_workflow(id) | |||
end | |||
|
|||
def self.page(start=0, stop=99, order: :asc) | |||
Gush::Client.new.workflows(start, stop, order: order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I do prefer to have 'all positional or all kw' args unless I have a really strong reason not to. Mixing them can be confusing as you make more contexts for usage. I think you do a good job handling the differences in the code, just a style thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mirroring the redis zrange argument style here (though one difference is that that method doesn't have default values for start
and stop
). I don't have a strong preference either way. I'm out on vacation at the moment, but if @pokonski also prefers the all-kwarg style I can make that change in a week or two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a preference here, so up to you 👍
ttl ||= configuration.ttl | ||
|
||
if ttl&.positive? | ||
redis.zadd("gush.idx.workflows.expires_at", Time.now.to_f + ttl, workflow.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a bit about using the #to_f value for this. That value is explicitly an approximation (of the rational representation), and doing math with floats is notoriously ... not fun.
We could try and do something using the integer (ie #to_i
) and sub-second (ie #usec
or #nsec
) values themselves to avoid any approximation and float messiness down the road. The first thing that comes to mind would be some concatenated integer/BigDecimal representation. This kind of approach would even let us be explicit about the precision of our value here, which I do like. The limit being just how big a number Redis will let us put in there :)
This might be 'not worth it' right now (which is a fine answer) but I do worry we might set up some nasty surprises later with a float here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redis uses floating point numbers for sorted scores. I think the rationale for using them here is:
- We need the sub-second precision for ordering workflows by creation time
- We want the integer value to represent seconds (rather than milliseconds or something) so that clients can easily pass in a timestamp value with or without fractional seconds
- We want minimal friction with the redis implementation
- This score is solely for sorting and filtering on timestamp values, so I wouldn't anticipate the kind of problematic floating point imprecision you see with hand-entered values (like money or hours worked or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, the Redis side is a float regardless. The imprecision of the estimate for the Time#to_f value should be at the nanosecond (or at worst maybe 10s of nanoseconds) level so it's unlikely that we would get any ordering wonkiness unless our throughput here was extremely high (or extremely distributed etc). Even then the ordering being off may or may not be significant. Can definitely see this being a tradeoff that works well for this code.
The other concerns would really be implementation side for anyone doing some calculation based off the Gush expire_at values in their code but they should take care to check things before they do math with Time floats anyways :)
Thanks for replying!
Had a couple minor thoughts but I agree with @pokonski this is awesome! Big ups! |
@natemontgomery, thanks again for the careful review. I appreciate your thoughts on these changes. @pokonski thanks for your review as well and openess to these changes. I've resolved the conflicts and pushed the new code. Let me know if you think there should be any other changes. If so, I'll be back from vacation in a week and can look at them after that. If not, I'm happy for the code to be merged as-is. |
Awesome! I merged the Rubocop PR first so this one should be easier to resolve before merging (just run the rubocop on this branch 🤞 ) |
So, I couldn't stop myself from noodling on the time stuff some more. I think it may be worth considering a use of For example, I think I might like to use the Linux CLOCK_TAI as this provides guarantees that the clock will not encounter leap seconds and the like that are triggered by clocks that are NTP adjusted. I could also see myself sometimes needing the absolute fastest option which seems to be the CLOCK_REALTIME_COARSE for my Linux. Some of these clocks are slower than the #to_f call and they are system dependent so having a config option to provide a mechanism to choose your preferred clock would probably be best if anything. Here is a simple benchmark for example:
I think this is probably something for follow-up work really, as your PR is ready to go and this is not worth waiting on I don't think. If anything, I will probably just open a PR when things are settled to suggest some options. TL;DR: I think we might want some config for clock source but we can do it later. |
@pokonski, I've resolved conflicts and I think this is ready to merge in. ETA: I just added a |
…uto-expiration This commit makes several interrelated changes: 1. Replaced the redis key scan to find job keys in `Client#find_workflow` with job class name persistence in `Workflow#to_hash`. This significantly improves performance when loading many workflows because it avoids n key scans. 2. Added `Client#workflow_ids` with sorting by creation timestamp and pagination as an alternative to `Client#all_workflows`, which has performance issues in large keyspaces and returns unwieldy amounts of data given a large number of workflows. 3. Added workflow and job indexes by `created_at` and `expires_at` timestamps. The former is necessary for paging through sorted workflow ids, and the latter is necessary to remove data on expiration. 4. Replace use of redis key TTL with explicit expiration via `Client#expire_workflows`, since there's no other way to remove data from the indexes. 5. Added a migration file (and infrastructure) to migrate to the new indexes and expiration format. 6. Added `Client#workflows_count` to get a count of all workflows, which is helpful for pagination. Given a redis instance with 10,000 workflows, this set of changes allows a page of the most recent 100 workflows to be loaded in 0.1 seconds, whereas previously `all_workflows` would take hours to return data. (Or, for a less extreme example of 1000 workflows, we can load 100 workflows in 0.1 seconds compared to `all_workflows` taking 42 seconds).
Thank you, merging <3 |
This commit makes several interrelated changes:
Replaced the redis key scan to find job keys in
Client#find_workflow
with job class name persistence inWorkflow#to_hash
. This significantly improves performance when loading many workflows because it avoids n key scans.Added
Client#workflow_ids
with sorting by creation timestamp and pagination as an alternative toClient#all_workflows
, which has performance issues in large keyspaces and returns unwieldy amounts of data given a large number of workflows.Added workflow and job indexes by
created_at
andexpires_at
timestamps. The former is necessary for paging through sorted workflow ids, and the latter is necessary to remove data on expiration.Replace use of redis key TTL with explicit expiration via
Client#expire_workflows
, since there's no other way to remove data from the indexes.Added a migration file (and infrastructure) to migrate to the new indexes and expiration format.
Given a redis instance with 10,000 workflows, this set of changes allows a page of the most recent 100 workflows to be loaded in 0.1 seconds, whereas previously
all_workflows
would take hours to return data.(Or, for a less extreme example of 1000 workflows, we can load 100 workflows in 0.1 seconds compared to
all_workflows
taking 42 seconds).Sample code for performance testing
On
master
:On this branch: