-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrangement batch formation costs in proportion to outstanding updates #460
Comments
I implemented this idea as part of MaterializeInc/materialize#25720 and also studied the linked paper out of interest. I noticed an interesting property that I thought I'd share. In the paper the authors talk about whether a chain decomposition of a poset is optimal, with the lower bound being As an example, if we take |
The challenge of using this generally is that you don't end up with the optimal width chains using this technique for common patterns with nested scopes (vs the Kafka part timestamps I think you are looking at). Specifically, if timestamps have the form
These are already sorted, but the runs are very short and very numerous. |
But you are totally right, if you have a reason to think one of the domains is small, amazing and order by that first. For example, if we had a cunning way to know that we were looking at |
When an arrangement builder is presented with (far) future updates, they linger in the holding pen and are reconsidered for each batch that is formed. This increases the cost of "extracting" a batch from "proportional to the batch size" to "proportional to all outstanding updates". This is especially noticeable in Materialize with it's
mz_now()
temporal filters, which introduce (far) future retractions.To see this in the existing codebase, consider the following modification to
examples/hello.rs
:The does the bulk loading of data at the start using a future time (
1_000_000
) and which lingers indefinitely, for our purposes.Unmodified,
produces output that looks like
whereas modified
So, clearly there is some orders of magnitude difference between the two.
The best remedy at the moment seems to be to do the batch organization "by time" rather than "by data". This is too bad because batches will want the updates organized by data, but the frontier-based extraction would prefer to have them organized by time. However, due to the partial order on times, there's no guarantee that we can do anything quickly; binary search does not apply in the same way it does for totally ordered times.
Nonetheless, seems like a good idea to build the partially robust operator, which orders by times and if we want to support partially ordered times probably maintains runs of
less_equal
times which do admit binary search (but all of our updates may be an arbitrarily large number of runs).Future work might look in to Sorting and Selection in Posets, which aims for bounds related to the "width" of the poset: the largest antichain. There will be no such bound with the above technique.
cc: @antiguru, @petrosagg
The text was updated successfully, but these errors were encountered: