Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved redis performance in large keyspace, added pagination, and auto-expiration #119

Merged
merged 1 commit into from
Aug 19, 2024

Conversation

noahfpf
Copy link
Contributor

@noahfpf noahfpf commented Aug 6, 2024

This commit makes several interrelated changes:

  1. Replaced the redis key scan to find job keys in Client#find_workflow with job class name persistence in Workflow#to_hash. This significantly improves performance when loading many workflows because it avoids n key scans.

  2. Added Client#workflow_ids with sorting by creation timestamp and pagination as an alternative to Client#all_workflows, which has performance issues in large keyspaces and returns unwieldy amounts of data given a large number of workflows.

  3. Added workflow and job indexes by created_at and expires_at timestamps. The former is necessary for paging through sorted workflow ids, and the latter is necessary to remove data on expiration.

  4. Replace use of redis key TTL with explicit expiration via Client#expire_workflows, since there's no other way to remove data from the indexes.

  5. Added a migration file (and infrastructure) to migrate to the new indexes and expiration format.

Given a redis instance with 10,000 workflows, this set of changes allows a page of the most recent 100 workflows to be loaded in 0.1 seconds, whereas previously all_workflows would take hours to return data.

(Or, for a less extreme example of 1000 workflows, we can load 100 workflows in 0.1 seconds compared to all_workflows taking 42 seconds).

Sample code for performance testing

require './lib/gush'
require 'benchmark'

class Prepare < Gush::Job; end
class NormalizeJob < Gush::Job; end

class TestWorkflow < Gush::Workflow
  def configure(foo, bar)
    run Prepare, params: { foo: foo, bar: bar }
    run NormalizeJob, after: Prepare
  end
end

client = Gush::Client.new
client.configuration.ttl = 60 * 60

redis = client.send(:redis)

# !!! This will DELETE all the keys in redis -- only run it in a test environment !!!
# redis.flushall

500.times do
  TestWorkflow.create('foo', 'bar')
end

500.times do
  flow = TestWorkflow.create('foo', 'bar')
  flow.expire!
end

redis.info['db0']
# "keys=3000,expires=1500,avg_ttl=3598013"

On master:

bm = Benchmark.measure do
  client.all_workflows
end
puts bm
#  4.310334   2.952727   7.263061 ( 41.705005)

On this branch:

bm = Benchmark.measure do
  client.workflows # first 100
end
puts bm
#  0.016248   0.007818   0.024066 (  0.110479)

bm = Benchmark.measure do
  client.workflows(0, -1) # all 1000
end
puts bm
#  0.088489   0.033171   0.121660 (  0.502829)

bm = Benchmark.measure do
  client.all_workflows
end
puts bm
#  0.116190   0.043420   0.159610 (  0.600842)

@noahfpf
Copy link
Contributor Author

noahfpf commented Aug 6, 2024

Hi @pokonski, looking forward to getting your feedback on this set of changes! I realize they have client-facing implications that make them trickier to integrate. I think the performance benefits of indexing workflow and job ids are worth the trade-offs, but they are definitely trade-offs.

If you don't think the index approach is workable, I think probably:

  1. We should split off the job_klasses change to workflow persistence and loading, because that has a huge performance benefit
  2. I'll probably implement a SQL layer that mirrors the redis persistence so that my organization can have an admin interface to the workflows

@pokonski
Copy link
Contributor

pokonski commented Aug 7, 2024

Oh WOW! That is a massive improvement! I am 100% fine with changing data structures for a boost this great (nice touch with the migration ❤️ ) - I also experimented with native Redis Graphs for performance reasons (#96) but that turned out waaay slower (and also Graphs are now deprecated :D )

Definitely warrants a major bump now :)

@pokonski
Copy link
Contributor

pokonski commented Aug 8, 2024

Can I ask you to resolve conflicts here? I merged your other PRs. After that I will merge the Rubocop PR (so the style changes are last)

end

def up
# subclass responsibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider raising a NotImplemented error for this? To guide implementation I think it might help clarify.

@@ -22,6 +22,10 @@ def self.find(id)
Gush::Client.new.find_workflow(id)
end

def self.page(start=0, stop=99, order: :asc)
Gush::Client.new.workflows(start, stop, order: order)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I do prefer to have 'all positional or all kw' args unless I have a really strong reason not to. Mixing them can be confusing as you make more contexts for usage. I think you do a good job handling the differences in the code, just a style thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mirroring the redis zrange argument style here (though one difference is that that method doesn't have default values for start and stop). I don't have a strong preference either way. I'm out on vacation at the moment, but if @pokonski also prefers the all-kwarg style I can make that change in a week or two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference here, so up to you 👍

ttl ||= configuration.ttl

if ttl&.positive?
redis.zadd("gush.idx.workflows.expires_at", Time.now.to_f + ttl, workflow.id)
Copy link

@natemontgomery natemontgomery Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a bit about using the #to_f value for this. That value is explicitly an approximation (of the rational representation), and doing math with floats is notoriously ... not fun.

We could try and do something using the integer (ie #to_i) and sub-second (ie #usec or #nsec) values themselves to avoid any approximation and float messiness down the road. The first thing that comes to mind would be some concatenated integer/BigDecimal representation. This kind of approach would even let us be explicit about the precision of our value here, which I do like. The limit being just how big a number Redis will let us put in there :)

This might be 'not worth it' right now (which is a fine answer) but I do worry we might set up some nasty surprises later with a float here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis uses floating point numbers for sorted scores. I think the rationale for using them here is:

  1. We need the sub-second precision for ordering workflows by creation time
  2. We want the integer value to represent seconds (rather than milliseconds or something) so that clients can easily pass in a timestamp value with or without fractional seconds
  3. We want minimal friction with the redis implementation
  4. This score is solely for sorting and filtering on timestamp values, so I wouldn't anticipate the kind of problematic floating point imprecision you see with hand-entered values (like money or hours worked or something)

Copy link

@natemontgomery natemontgomery Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the Redis side is a float regardless. The imprecision of the estimate for the Time#to_f value should be at the nanosecond (or at worst maybe 10s of nanoseconds) level so it's unlikely that we would get any ordering wonkiness unless our throughput here was extremely high (or extremely distributed etc). Even then the ordering being off may or may not be significant. Can definitely see this being a tradeoff that works well for this code.

The other concerns would really be implementation side for anyone doing some calculation based off the Gush expire_at values in their code but they should take care to check things before they do math with Time floats anyways :)

Thanks for replying!

@natemontgomery
Copy link

Had a couple minor thoughts but I agree with @pokonski this is awesome! Big ups!

@noahfpf
Copy link
Contributor Author

noahfpf commented Aug 11, 2024

@natemontgomery, thanks again for the careful review. I appreciate your thoughts on these changes.

@pokonski thanks for your review as well and openess to these changes. I've resolved the conflicts and pushed the new code. Let me know if you think there should be any other changes. If so, I'll be back from vacation in a week and can look at them after that. If not, I'm happy for the code to be merged as-is.

@pokonski
Copy link
Contributor

Awesome! I merged the Rubocop PR first so this one should be easier to resolve before merging (just run the rubocop on this branch 🤞 )

@natemontgomery
Copy link

So, I couldn't stop myself from noodling on the time stuff some more. I think it may be worth considering a use of Process.clock_gettime(<clock constant>) at least as a config option.

For example, I think I might like to use the Linux CLOCK_TAI as this provides guarantees that the clock will not encounter leap seconds and the like that are triggered by clocks that are NTP adjusted.

I could also see myself sometimes needing the absolute fastest option which seems to be the CLOCK_REALTIME_COARSE for my Linux.

Some of these clocks are slower than the #to_f call and they are system dependent so having a config option to provide a mechanism to choose your preferred clock would probably be best if anything.

Here is a simple benchmark for example:

Rehearsal ----------------------------------------------------------
#to_r                    0.019512   0.000000   0.019512 (  0.019510)
#to_f                    0.004803   0.000000   0.004803 (  0.004803)
#to_i                    0.003160   0.000000   0.003160 (  0.003161)
CLOCK_REALTIME           0.006418   0.000000   0.006418 (  0.006419)
CLOCK_REALTIME_COARSE    0.004633   0.000000   0.004633 (  0.004634)
CLOCK_TAI                0.006341   0.000000   0.006341 (  0.006341)
------------------------------------------------- total: 0.044867sec

                            user     system      total        real
#to_r                    0.020774   0.000000   0.020774 (  0.020782)
#to_f                    0.005062   0.000000   0.005062 (  0.005061)
#to_i                    0.003170   0.000000   0.003170 (  0.003168)
CLOCK_REALTIME           0.006273   0.000000   0.006273 (  0.006271)
CLOCK_REALTIME_COARSE    0.004497   0.000000   0.004497 (  0.004495)
CLOCK_TAI                0.006351   0.000000   0.006351 (  0.006349)

I think this is probably something for follow-up work really, as your PR is ready to go and this is not worth waiting on I don't think.

If anything, I will probably just open a PR when things are settled to suggest some options.

TL;DR: I think we might want some config for clock source but we can do it later.

@noahfpf
Copy link
Contributor Author

noahfpf commented Aug 19, 2024

@pokonski, I've resolved conflicts and I think this is ready to merge in.

ETA: I just added a Client#workflows_count method which supports client pagination.

…uto-expiration

This commit makes several interrelated changes:

1. Replaced the redis key scan to find job keys in `Client#find_workflow` with
   job class name persistence in `Workflow#to_hash`. This significantly improves
   performance when loading many workflows because it avoids n key scans.

2. Added `Client#workflow_ids` with sorting by creation timestamp and pagination
   as an alternative to `Client#all_workflows`, which has performance issues in
   large keyspaces and returns unwieldy amounts of data given a large number of
   workflows.

3. Added workflow and job indexes by `created_at` and `expires_at` timestamps.
   The former is necessary for paging through sorted workflow ids, and the latter
   is necessary to remove data on expiration.

4. Replace use of redis key TTL with explicit expiration via `Client#expire_workflows`,
   since there's no other way to remove data from the indexes.

5. Added a migration file (and infrastructure) to migrate to the new indexes
   and expiration format.

6. Added `Client#workflows_count` to get a count of all workflows, which is
   helpful for pagination.

Given a redis instance with 10,000 workflows, this set of changes allows a page
of the most recent 100 workflows to be loaded in 0.1 seconds, whereas previously
`all_workflows` would take hours to return data.

(Or, for a less extreme example of 1000 workflows, we can load 100 workflows in
0.1 seconds compared to `all_workflows` taking 42 seconds).
@pokonski
Copy link
Contributor

Thank you, merging <3

@pokonski pokonski merged commit bcaae81 into chaps-io:master Aug 19, 2024
12 checks passed
@noahfpf noahfpf deleted the redis-indexes branch August 19, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants