Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log frontiers in progress tracking #539

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Nov 27, 2023

This PR adds logging of frontiers to the reachability tracker. This is in support of hydration progress tracking in Materialize, which would like to know the per-operator frontiers to know which ones have finished processing their snapshot inputs.

Look at the commit messages for more details. The actual logging change is in the second commit, commit 1 and 2 are supporting changes that try to limit the performance impact of the additional logging.

The performance impact of this change should be negligible (or even negative!) when reachability logging is disabled and small when reachability logging is enabled. I'm including microbenchmark results in a comment below.

The reason that commit 3 makes the code more performant is, I think, that is makes it possible to allocate the vec that's passed to log_frontier_updates with the correct size upfront, rather than having to geometrically grow two vecs of unknown final size. We can probably get a similar effect without changing how TrackerEvent looks, by counting the number of target and source changes while pushing to pushed_changes. I like the TrackerEvent change because it removes some duplication, but it's not a required one.

This commit makes the reachability tracker directly return changes to
implications from `propagate_all`, rather than stashing them and
expecting callers to access them through a separate `pushed` method.
This API changes allows `propagate_all` to assume that the caller will
consume the changes before the next call to `propagate_all`, which in
turn guarantees that `pushed_changes` only contains changes from the
last `propagate_all` call, rather than possibly changes from previous
calls as well.
@teskje teskje force-pushed the reachability-frontier-logging branch from 69285e3 to a92df78 Compare November 28, 2023 18:09
This commit changes the shape of `TrackerEvent` to include the
source/target distinction in a `Location` rather than separate enum
variants. Apart from less boilerplate, this speeds up the frontier
logging in `propagate_all` as we can directly use the contents of
`pushed_changes`, instead of having to split them into target and source
changes.

A side effect of this is that the definition of
`ProgressEventTimestampVec` changes, which also affects the
`TimelyProgressEvent` type.
@teskje
Copy link
Contributor Author

teskje commented Nov 29, 2023

Benchmark Results

To test the performance impact of this change, I ran event_driven.rs with 1000 dataflows of a depth of 1000 each for 100k rounds, and measured the time to completion.

The test was performed in two configurations:

  • with reachability logging disabled
  • with reachability logging enabled and a logging closure that simply increments a counter on each event

Below are the results of 10 runs each, for both of the configurations against the master branch and each of the three commits in this PR:

Screenshot 2023-11-29 at 11 47 56

Screenshot 2023-11-29 at 11 48 25

From the results we can see that in this benchmark the additional logging has no negative effect on runtime if logging is disabled, and has a small negative effect when logging is enabled. The latter is expected.

The former is maybe expected, though there is some weirdness here in that the results show that on master disabling logging actually slows Timely down! I'm not sure why this is. This result occurs consistently when the benchmark is repeated at different times, so it doesn't seem to be caused by noise in the system. Maybe some edge case related to CPU caching?

@teskje
Copy link
Contributor Author

teskje commented Nov 29, 2023

I wired operator frontier logging up in Mz to verify that this is useful for tracking hydration progress, and it seems to be!

With a simple cross join dataflow:

CREATE TABLE t (a int);
INSERT INTO t SELECT generate_series(1, 10000);

CREATE VIEW v AS SELECT t1.a + t2.a FROM t t1, t t2 WHERE t1.a < 1000000 AND t2.a > 0;
CREATE DEFAULT INDEX ON v;

We get this frontier output:

materialize=> SELECT id, min(time) as time
FROM mz_internal.mz_dataflow_operator_frontiers
JOIN mz_internal.mz_dataflow_addresses USING (address)
JOIN mz_internal.mz_dataflow_operators USING (id)
WHERE
  time IS NOT NULL AND
  list_length(address) = 3
GROUP BY id
ORDER BY time DESC, id ASC;
  id  |     time
------+---------------
 8587 | 1701260559660
 8592 | 1701260558660
 8594 | 1701260558660
 -- ...
 8636 | 1701260558660
 8658 | 1701260383161
 8659 | 1701260383161
 8661 | 1701260383161
 -- ...
 8674 | 1701260383161
(31 rows)

... which implicates operator 8658 as the one that's stuck. And sure enough, looking at the dataflow graph, this operator is the cross join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant