Improved redis performance in large keyspace, added pagination, and auto-expiration #119

noahfpf · 2024-08-06T17:22:01Z

This commit makes several interrelated changes:

Replaced the redis key scan to find job keys in Client#find_workflow with job class name persistence in Workflow#to_hash. This significantly improves performance when loading many workflows because it avoids n key scans.
Added Client#workflow_ids with sorting by creation timestamp and pagination as an alternative to Client#all_workflows, which has performance issues in large keyspaces and returns unwieldy amounts of data given a large number of workflows.
Added workflow and job indexes by created_at and expires_at timestamps. The former is necessary for paging through sorted workflow ids, and the latter is necessary to remove data on expiration.
Replace use of redis key TTL with explicit expiration via Client#expire_workflows, since there's no other way to remove data from the indexes.
Added a migration file (and infrastructure) to migrate to the new indexes and expiration format.

Given a redis instance with 10,000 workflows, this set of changes allows a page of the most recent 100 workflows to be loaded in 0.1 seconds, whereas previously all_workflows would take hours to return data.

(Or, for a less extreme example of 1000 workflows, we can load 100 workflows in 0.1 seconds compared to all_workflows taking 42 seconds).

Sample code for performance testing

require './lib/gush'
require 'benchmark'

class Prepare < Gush::Job; end
class NormalizeJob < Gush::Job; end

class TestWorkflow < Gush::Workflow
  def configure(foo, bar)
    run Prepare, params: { foo: foo, bar: bar }
    run NormalizeJob, after: Prepare
  end
end

client = Gush::Client.new
client.configuration.ttl = 60 * 60

redis = client.send(:redis)

# !!! This will DELETE all the keys in redis -- only run it in a test environment !!!
# redis.flushall

500.times do
  TestWorkflow.create('foo', 'bar')
end

500.times do
  flow = TestWorkflow.create('foo', 'bar')
  flow.expire!
end

redis.info['db0']
# "keys=3000,expires=1500,avg_ttl=3598013"

On master:

bm = Benchmark.measure do
  client.all_workflows
end
puts bm
#  4.310334   2.952727   7.263061 ( 41.705005)

On this branch:

bm = Benchmark.measure do
  client.workflows # first 100
end
puts bm
#  0.016248   0.007818   0.024066 (  0.110479)

bm = Benchmark.measure do
  client.workflows(0, -1) # all 1000
end
puts bm
#  0.088489   0.033171   0.121660 (  0.502829)

bm = Benchmark.measure do
  client.all_workflows
end
puts bm
#  0.116190   0.043420   0.159610 (  0.600842)

noahfpf · 2024-08-06T17:32:44Z

Hi @pokonski, looking forward to getting your feedback on this set of changes! I realize they have client-facing implications that make them trickier to integrate. I think the performance benefits of indexing workflow and job ids are worth the trade-offs, but they are definitely trade-offs.

If you don't think the index approach is workable, I think probably:

We should split off the job_klasses change to workflow persistence and loading, because that has a huge performance benefit
I'll probably implement a SQL layer that mirrors the redis persistence so that my organization can have an admin interface to the workflows

pokonski · 2024-08-07T11:18:14Z

Oh WOW! That is a massive improvement! I am 100% fine with changing data structures for a boost this great (nice touch with the migration ❤️ ) - I also experimented with native Redis Graphs for performance reasons (#96) but that turned out waaay slower (and also Graphs are now deprecated :D )

Definitely warrants a major bump now :)

pokonski · 2024-08-08T08:50:14Z

Can I ask you to resolve conflicts here? I merged your other PRs. After that I will merge the Rubocop PR (so the style changes are last)

natemontgomery · 2024-08-08T19:31:30Z

lib/gush/migration.rb

+    end
+
+    def up
+      # subclass responsibility


Should we consider raising a NotImplemented error for this? To guide implementation I think it might help clarify.

natemontgomery · 2024-08-08T19:33:46Z

lib/gush/workflow.rb

@@ -22,6 +22,10 @@ def self.find(id)
      Gush::Client.new.find_workflow(id)
    end

+    def self.page(start=0, stop=99, order: :asc)
+      Gush::Client.new.workflows(start, stop, order: order)


Personally, I do prefer to have 'all positional or all kw' args unless I have a really strong reason not to. Mixing them can be confusing as you make more contexts for usage. I think you do a good job handling the differences in the code, just a style thought.

I was mirroring the redis zrange argument style here (though one difference is that that method doesn't have default values for start and stop). I don't have a strong preference either way. I'm out on vacation at the moment, but if @pokonski also prefers the all-kwarg style I can make that change in a week or two.

I don't have a preference here, so up to you 👍

natemontgomery · 2024-08-08T20:01:33Z

lib/gush/client.rb

+      ttl ||= configuration.ttl
+
+      if ttl&.positive?
+        redis.zadd("gush.idx.workflows.expires_at", Time.now.to_f + ttl, workflow.id)


I worry a bit about using the #to_f value for this. That value is explicitly an approximation (of the rational representation), and doing math with floats is notoriously ... not fun.

We could try and do something using the integer (ie #to_i) and sub-second (ie #usec or #nsec) values themselves to avoid any approximation and float messiness down the road. The first thing that comes to mind would be some concatenated integer/BigDecimal representation. This kind of approach would even let us be explicit about the precision of our value here, which I do like. The limit being just how big a number Redis will let us put in there :)

This might be 'not worth it' right now (which is a fine answer) but I do worry we might set up some nasty surprises later with a float here.

Redis uses floating point numbers for sorted scores. I think the rationale for using them here is:

We need the sub-second precision for ordering workflows by creation time

We want the integer value to represent seconds (rather than milliseconds or something) so that clients can easily pass in a timestamp value with or without fractional seconds

We want minimal friction with the redis implementation

This score is solely for sorting and filtering on timestamp values, so I wouldn't anticipate the kind of problematic floating point imprecision you see with hand-entered values (like money or hours worked or something)

True, the Redis side is a float regardless. The imprecision of the estimate for the Time#to_f value should be at the nanosecond (or at worst maybe 10s of nanoseconds) level so it's unlikely that we would get any ordering wonkiness unless our throughput here was extremely high (or extremely distributed etc). Even then the ordering being off may or may not be significant. Can definitely see this being a tradeoff that works well for this code.

The other concerns would really be implementation side for anyone doing some calculation based off the Gush expire_at values in their code but they should take care to check things before they do math with Time floats anyways :)

Thanks for replying!

natemontgomery · 2024-08-08T20:03:57Z

Had a couple minor thoughts but I agree with @pokonski this is awesome! Big ups!

noahfpf · 2024-08-11T14:25:40Z

@natemontgomery, thanks again for the careful review. I appreciate your thoughts on these changes.

@pokonski thanks for your review as well and openess to these changes. I've resolved the conflicts and pushed the new code. Let me know if you think there should be any other changes. If so, I'll be back from vacation in a week and can look at them after that. If not, I'm happy for the code to be merged as-is.

pokonski · 2024-08-12T13:54:09Z

Awesome! I merged the Rubocop PR first so this one should be easier to resolve before merging (just run the rubocop on this branch 🤞 )

natemontgomery · 2024-08-13T23:30:19Z

So, I couldn't stop myself from noodling on the time stuff some more. I think it may be worth considering a use of Process.clock_gettime(<clock constant>) at least as a config option.

For example, I think I might like to use the Linux CLOCK_TAI as this provides guarantees that the clock will not encounter leap seconds and the like that are triggered by clocks that are NTP adjusted.

I could also see myself sometimes needing the absolute fastest option which seems to be the CLOCK_REALTIME_COARSE for my Linux.

Some of these clocks are slower than the #to_f call and they are system dependent so having a config option to provide a mechanism to choose your preferred clock would probably be best if anything.

Here is a simple benchmark for example:

Rehearsal ----------------------------------------------------------
#to_r                    0.019512   0.000000   0.019512 (  0.019510)
#to_f                    0.004803   0.000000   0.004803 (  0.004803)
#to_i                    0.003160   0.000000   0.003160 (  0.003161)
CLOCK_REALTIME           0.006418   0.000000   0.006418 (  0.006419)
CLOCK_REALTIME_COARSE    0.004633   0.000000   0.004633 (  0.004634)
CLOCK_TAI                0.006341   0.000000   0.006341 (  0.006341)
------------------------------------------------- total: 0.044867sec

                            user     system      total        real
#to_r                    0.020774   0.000000   0.020774 (  0.020782)
#to_f                    0.005062   0.000000   0.005062 (  0.005061)
#to_i                    0.003170   0.000000   0.003170 (  0.003168)
CLOCK_REALTIME           0.006273   0.000000   0.006273 (  0.006271)
CLOCK_REALTIME_COARSE    0.004497   0.000000   0.004497 (  0.004495)
CLOCK_TAI                0.006351   0.000000   0.006351 (  0.006349)

I think this is probably something for follow-up work really, as your PR is ready to go and this is not worth waiting on I don't think.

If anything, I will probably just open a PR when things are settled to suggest some options.

TL;DR: I think we might want some config for clock source but we can do it later.

noahfpf · 2024-08-19T14:44:02Z

@pokonski, I've resolved conflicts and I think this is ready to merge in.

ETA: I just added a Client#workflows_count method which supports client pagination.

…uto-expiration This commit makes several interrelated changes: 1. Replaced the redis key scan to find job keys in `Client#find_workflow` with job class name persistence in `Workflow#to_hash`. This significantly improves performance when loading many workflows because it avoids n key scans. 2. Added `Client#workflow_ids` with sorting by creation timestamp and pagination as an alternative to `Client#all_workflows`, which has performance issues in large keyspaces and returns unwieldy amounts of data given a large number of workflows. 3. Added workflow and job indexes by `created_at` and `expires_at` timestamps. The former is necessary for paging through sorted workflow ids, and the latter is necessary to remove data on expiration. 4. Replace use of redis key TTL with explicit expiration via `Client#expire_workflows`, since there's no other way to remove data from the indexes. 5. Added a migration file (and infrastructure) to migrate to the new indexes and expiration format. 6. Added `Client#workflows_count` to get a count of all workflows, which is helpful for pagination. Given a redis instance with 10,000 workflows, this set of changes allows a page of the most recent 100 workflows to be loaded in 0.1 seconds, whereas previously `all_workflows` would take hours to return data. (Or, for a less extreme example of 1000 workflows, we can load 100 workflows in 0.1 seconds compared to `all_workflows` taking 42 seconds).

pokonski · 2024-08-19T21:01:57Z

Thank you, merging <3

natemontgomery reviewed Aug 8, 2024

View reviewed changes

noahfpf force-pushed the redis-indexes branch from 2a0f336 to 1f43ee1 Compare August 11, 2024 14:14

noahfpf force-pushed the redis-indexes branch from 1f43ee1 to b70be59 Compare August 19, 2024 14:41

noahfpf force-pushed the redis-indexes branch from b70be59 to d901956 Compare August 19, 2024 20:34

pokonski merged commit bcaae81 into chaps-io:master Aug 19, 2024
12 checks passed

noahfpf deleted the redis-indexes branch August 19, 2024 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved redis performance in large keyspace, added pagination, and auto-expiration #119

Improved redis performance in large keyspace, added pagination, and auto-expiration #119

noahfpf commented Aug 6, 2024

noahfpf commented Aug 6, 2024

pokonski commented Aug 7, 2024 •

edited

Loading

pokonski commented Aug 8, 2024

natemontgomery Aug 8, 2024

natemontgomery Aug 8, 2024

noahfpf Aug 11, 2024

pokonski Aug 12, 2024

natemontgomery Aug 8, 2024 •

edited

Loading

noahfpf Aug 11, 2024

natemontgomery Aug 12, 2024 •

edited

Loading

natemontgomery commented Aug 8, 2024

noahfpf commented Aug 11, 2024

pokonski commented Aug 12, 2024

natemontgomery commented Aug 13, 2024

noahfpf commented Aug 19, 2024 •

edited

Loading

pokonski commented Aug 19, 2024

Improved redis performance in large keyspace, added pagination, and auto-expiration #119

Improved redis performance in large keyspace, added pagination, and auto-expiration #119

Conversation

noahfpf commented Aug 6, 2024

Sample code for performance testing

noahfpf commented Aug 6, 2024

pokonski commented Aug 7, 2024 • edited Loading

pokonski commented Aug 8, 2024

natemontgomery Aug 8, 2024

Choose a reason for hiding this comment

natemontgomery Aug 8, 2024

Choose a reason for hiding this comment

noahfpf Aug 11, 2024

Choose a reason for hiding this comment

pokonski Aug 12, 2024

Choose a reason for hiding this comment

natemontgomery Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

noahfpf Aug 11, 2024

Choose a reason for hiding this comment

natemontgomery Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

natemontgomery commented Aug 8, 2024

noahfpf commented Aug 11, 2024

pokonski commented Aug 12, 2024

natemontgomery commented Aug 13, 2024

noahfpf commented Aug 19, 2024 • edited Loading

pokonski commented Aug 19, 2024

pokonski commented Aug 7, 2024 •

edited

Loading

natemontgomery Aug 8, 2024 •

edited

Loading

natemontgomery Aug 12, 2024 •

edited

Loading

noahfpf commented Aug 19, 2024 •

edited

Loading