Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass data from batcher to builder by chunk #491

Merged
merged 2 commits into from
May 23, 2024

Conversation

antiguru
Copy link
Member

Currently, the data shared between the batcher and the builder are individual tuples, either moved or by reference. This limits flexibility around what kind of data can be provided to a builder, i.e., it has to be in the form of tuples, either owned or a reference to a fully-formed one. This works fine for vector-like structures, but will not work for containers that like to arrange their data differently.

This change alters the contract between the batcher and the builder to provide chunks instead of individual items (it does not require chains.) The data in the chunks must be sorted, and subsequent calls must maintain order, too. The input containers need to implement BuilderInput, a type that describes how a container's items can be broken into key, value, time, and diff, where key and value can be references or owned data, as long as they can be pushed into the underlying key and value containers.

The change has some quirks around comparing keys to keys already in the builder. The types can differ, and the best solution I could come up with was to add two explicit comparison functions to BuilderInput to compare keys and values. While it is not elegant, it allows us to move forward with this change, without adding nightmare-inducing trait bounds all-over.

Copy link
Member

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good. We discussed a bit about reduce.rs and how to avoid potentially exotic regressions there with large keys and multiple updates. That feels like something we want to think carefully about. Other discussion point was around ergonomics for the BuilderInput trait, and we guessed a bit there and maybe there is a better thing to do and maybe not.

@@ -448,10 +450,10 @@ where
// TODO: It would be better if all updates went into one batch, but timely dataflow prevents
// this as long as it requires that there is only one capability for each message.
let mut buffers = Vec::<(G::Timestamp, Vec<(V, G::Timestamp, T2::Diff)>)>::new();
let mut builders = Vec::new();
let mut chains = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that this is a potential regression, on account of staging all the data. You mentioned that another pattern is just to flush the buffers .. whenever one likes. I figure we should either jot that down in a comment, or make it be the case if we are worried about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into whether it's possible to populate a builder from ((&K, V), T, R) instead of cloning K. The short answer is no, the long answer is maybe. At the moment, we have two blockers:

  1. Reductions write their output into an arrangement, but bypass the merge batcher. We still need to provide a merge batcher type, even tho it'll never be used. We can test removing it by providing a dummy Batcher trait implementation that is mostly unimplemented!().
  2. The input chunks to the builder need to implement Container, which implies 'static. No way to store references! Remove Container: Clone + 'static timely-dataflow#540 might help here.

I think this is something we should keep as a to-do, but not prioritize until we have evidence it causes performance issues.

@@ -320,6 +363,109 @@ impl BatchContainer for OffsetList {
}
}

/// Behavior to split an update into principal components.
pub trait BuilderInput<L: Layout> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed whether this might work with a BuilderInput<'a, L>. We didn't conclude that it could or couldn't, but if it ended up being C::Item<'a>: BuilderInput<'a, L> that might be clearer than the alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought we had: Perhaps BuilderInput<L>: Container, and rather than introduce an Item<'a> here and equate it to C::Item<'a>, the signatures could just use C::Item<'a>.

@antiguru antiguru changed the title Pass data from batcher to builder by chain Pass data from batcher to builder by chunk May 21, 2024
@frankmcsherry
Copy link
Member

This is looking good, modulo the conflicts to resolve!

Currently, the data shared between the batcher and the builder are
individual tuples, either moved or by reference. This limits flexibility
around what kind of data can be provided to a builder, i.e., it has to
be in the form of tuples, either owned or a reference to a fully-formed
one. This works fine for vector-like structures, but will not work for
containers that like to arrange their data differently.

This change alters the contract between the batcher and the builder to
provide chunks instead of individual items (it does not require
_chains_.) The data in the chunks must be sorted, and subsequent calls
must maintain order, too. The input containers need to implement
`BuilderInput`, a type that describes how a container's items can be
broken into key, value, time, and diff, where key and value can be
references or owned data, as long as they can be pushed into the
underlying key and value containers.

The change has some quirks around comparing keys to keys already in the
builder. The types can differ, and the best solution I could come up
with was to add two explicit comparison functions to `BuilderInput` to
compare keys and values. While it is not elegant, it allows us to move
forward with this change, without adding nightmare-inducing trait bounds
all-over.

Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
@antiguru antiguru merged commit 85b126c into TimelyDataflow:master May 23, 2024
7 checks passed
@antiguru antiguru deleted the batcher_input_layout branch May 23, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants