storage: infrastructure for independent source output streams #30858

petrosagg · 2024-12-17T23:05:23Z

Motivation

This PR puts in the basis for allowing a source implementation to produce an independent DD collection per output.

The situation before this PR was that the SourceRender trait required a single, multiplexed DD collection to be produced, of type (usize, D), where the usize designated the output. Since all outputs were multiplexed, a single frontier had to describe their overall progress, which described the "slowest" one. This is generally fine when all outputs more or less march forwards together but it's not ok when a new subsource is added to a source that has otherwise been running for a while. In this situation the upper frontier of the multiplexed collection would necessarily have to stay stuck until the new subsource finished its snapshot and caught up with the other ones, making the previously healthy sources unavailable for all this time.

This PR fixes this by requiring a BTreeMap<GlobalId, Collection> output type from source implementations. This way each subsource can be driven independently and a new subsource can be added without imposing a frontier stall to the previously ingested subsources.

This PR does not change any of the source implementations to take advantage of this new interface, since that would create a giant PR. Instead, this only changes the interface and the fallout of all the changes in the various generic parts of the pipeline. Follow up PRs will be done that target individual source implementation and change them to be directly produce multiple collections.

Tips for reviewer

I went through the diff and left comments explaining parts of the changes.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

This PR puts in the basis for allowing a source implementation to produce an independent DD collection per output. The situation before this PR was that the `SourceRender` trait required a single, multiplexed DD collection to be produced, of type `(usize, D)`, where the `usize` designated the output. Since all outputs were multiplexed, a single frontier had to describe their overall progress, which described the "slowest" one. This is generally fine when all outputs more or less march forwards together but it's not ok when a new subsource is added to a source that has otherwise been running for a while. In this situation the upper frontier of the multiplexed collection would necessarily have to stay stuck until the new subsource finished its snapshot and caught up with the other ones, making the previously healthy sources unavailable for all this time. This PR fixes this by requiring a `BTreeMap<GlobalId, Collection>` output type from source implementations. This way each subsource can be driven independently and a new subsource can be added without imposing a frontier stall to the previously ingested subsources. This PR does not change any of the source implementations to take advantage of this new interface, since that would create a giant PR. Instead, this only changes the interface and the fallout of all the changes in the various generic parts of the pipeline. Follow up PRs will be done that target individual source implementation and change them to be directly produce multiple collections. Signed-off-by: Petros Angelatos <[email protected]>

petrosagg · 2024-12-17T23:09:47Z

src/storage/src/healthcheck.rs

-    pub index: usize,
+    /// The object that this status message is about. When None, it refers to the entire ingestion
+    /// as a whole. When Some, it refers to a specific subsource.
+    pub id: Option<GlobalId>,


The first set of changes relate to changing the healthcheck operator to identify statuses based on the GlobalId of the subsource instead of an index. The healthcheck operator has a special GlobalId, called the halting_id, which is the global id whose statuses are allowed to halt the source. Instead of having a magic id I chose to represent those special statuses as those having id: None, which can be thought of as a status update for the entire ingestion.

petrosagg · 2024-12-17T23:10:58Z

src/storage/src/metrics/source.rs

@@ -57,11 +56,6 @@ impl GeneralSourceMetricDefs {
        Self {
            // TODO(guswynn): some of these metrics are not clear when subsources are involved, and
            // should be fixed
-            capability: registry.register(metric!(


I removed this metric as it doesn't seem to be all that useful. We don't even have a panel for it in grafana and we have a separate one that tracks the frontier of each subsource. Since this metric was populated from an operator I deleted I chose to remove it as well instead of finding a new home for it

petrosagg · 2024-12-17T23:12:02Z

src/storage/src/metrics/source.rs

@@ -86,42 +80,40 @@ impl GeneralSourceMetricDefs {
            progress: registry.register(metric!(
                name: "mz_source_progress",
                help: "A timestamp gauge representing forward progess in the data shard",
-                var_labels: ["source_id", "output", "shard", "worker_id"],
+                var_labels: ["source_id", "shard", "worker_id"],


I removed all notions of "output index" from metrics. The index is not exposed anywhere externally and even in our logs we identify everything by its id.

petrosagg · 2024-12-17T23:13:26Z

src/storage/src/render.rs

@@ -272,7 +272,7 @@ pub fn build_ingestion_dataflow<A: Allocate>(
            let base_source_config = RawSourceCreationConfig {
                name: format!("{}-{}", connection.name(), primary_source_id),
                id: primary_source_id,
-                source_exports: description.indexed_source_exports(&primary_source_id),
+                source_exports: description.source_exports.clone(),


The entire notion and struct for "indexed source exports" is gone. Same theme here, we can use a GlobalId anywhere we need to identify a particular output instead of a usize

petrosagg · 2024-12-17T23:14:43Z

src/storage/src/source.rs

+use timely::Container;
+
+/// Partition a stream of records into multiple streams.
+pub trait PartitionCore<G: Scope, C: Container> {


This is a helper trait until TimelyDataflow/timely-dataflow#610 is merged

Mind adding that as a comment too?

petrosagg · 2024-12-17T23:22:33Z

src/storage/src/source/generator.rs

            };
            output_map
                .entry(output_type)
                .or_insert_with(Vec::new)
-                .push(export.ingestion_output);
+                .push(idx);


The idea in all of these is that indexing things by a usize is an implementation detail that sources can choose to do. So anything that was previously using the ingestion_output thing has been switched to be using the index of the source as found in the source_exports map

petrosagg · 2024-12-17T23:23:55Z

src/storage/src/source/generator.rs

+    let mut data_collections = BTreeMap::new();
+    for (id, data_stream) in config.source_exports.keys().zip_eq(data_streams) {
+        data_collections.insert(*id, data_stream.as_collection());
+    }


This is another common piece of code that this PR adds to all implementations. This piece of code converts the legacy multiplexed output of sources into the non-multiplexed one that the new interface requires. These will be removed as we start moving individual sources into the new interface

petrosagg · 2024-12-17T23:25:09Z

src/storage/src/source/source_reader_pipeline.rs

-            })
-            .capture_into(PusherCapture(reclock_pusher));
+        for (id, export) in exports {
+            let (reclock_pusher, reclocked) = reclock(&remap_collection, config.as_of.clone());


Here is the first substantial change of this PR. Instead of rendering a single reclock operator we iterate over all the exports and instantiate one reclock operator per output.

petrosagg · 2024-12-17T23:28:05Z

src/storage/src/source/source_reader_pipeline.rs

+        export_handles.push((id, export_input, export_output));
+        let new_export: StackedCollection<G, Result<SourceMessage, DataflowError>> =
+            new_export.as_collection();
+        export_collections.insert(id, new_export);


This loop is one that needs quite a bit of attention from the reviewer. This is an operator that simply passes through the data produced by the source and records some statistics and health related information as it sees the data passing by. Now that each output of a source is a separate stream this is an operator with N inputs and N outputs. The operator connections must be such that the i-th input is only connected to the i-th output, plus the progress collection.

It is important that we get this right since the failure mode will be hard to spot.

For my own understanding: The connection vec contains one entry for each output, specifying how the input is connected to that output. Here we construct connection for input i to be:

[ [0], [], [], ... [0] ] progress health output 1 ... output i ... output n

I would have expected connection to also have empty frontier entries for the outputs between i and n. I assume we can skip those because missing entries default to the empty frontier?

This test provides questionable value and requires modifying code behavior deep in the dataflow to trigger the condition it wants to test. Hence it is removed. I put this as a separate commit in case we change our minds during review and want to keep it. Signed-off-by: Petros Angelatos <[email protected]>

Signed-off-by: Petros Angelatos <[email protected]>

teskje

I think this is a great simplification to how source exports are managed. Thanks for the helpful comments!

teskje · 2024-12-18T13:52:09Z

test/cluster/mzcompose.py

-def workflow_pg_snapshot_partial_failure(c: Composition) -> None:
-    """Test PostgreSQL snapshot partial failure"""
-
-    c.down(destroy_volumes=True)
-
-    with c.override(
-        # Start postgres for the pg source
-        Testdrive(no_reset=True),
-        Clusterd(
-            name="clusterd1",
-            environment_extra=["FAILPOINTS=pg_snapshot_pause=return(2)"],
-        ),
-    ):
-        c.up("materialized", "postgres", "clusterd1")
-
-        c.run_testdrive_files("pg-snapshot-partial-failure/01-configure-postgres.td")
-        c.run_testdrive_files("pg-snapshot-partial-failure/02-create-sources.td")
-
-        c.run_testdrive_files(
-            "pg-snapshot-partial-failure/03-verify-good-sub-source.td"
-        )
-
-        c.kill("clusterd1")
-        # Restart the storage instance with the failpoint off...
-        with c.override(
-            # turn off the failpoint
-            Clusterd(name="clusterd1")
-        ):
-            c.run_testdrive_files("pg-snapshot-partial-failure/04-add-more-data.td")
-            c.up("clusterd1")
-            c.run_testdrive_files("pg-snapshot-partial-failure/05-verify-data.td")


I couldn't infer what this test is about, but is it not relevant anymore?

Edit: Ah, saw the commit message. I don't have an opinion on the value of this test, but it's true that the failpoint in persist_sink is quite randomizing and the code is better without it.

teskje · 2024-12-18T14:02:36Z

src/storage/src/source.rs

+use timely::Container;
+
+/// Partition a stream of records into multiple streams.
+pub trait PartitionCore<G: Scope, C: Container> {


Mind adding that as a comment too?

teskje · 2024-12-18T14:08:07Z

src/storage/src/source/source_reader_pipeline.rs

-    )>,
+        Collection<
+            Child<'g, G, mz_repr::Timestamp>,
+            Result<SourceOutput<C::Time>, DataflowError>,


Do we need to be worried about increasing the size of each item going through this collection? We often demultiplex oks/errs into separate collections because of that, but I'm not sure if a Result<SourceOutput, DataflowError> is actually larger than a SourceOutput.

teskje · 2024-12-18T14:34:09Z

src/storage/src/source/source_reader_pipeline.rs

+        export_handles.push((id, export_input, export_output));
+        let new_export: StackedCollection<G, Result<SourceMessage, DataflowError>> =
+            new_export.as_collection();
+        export_collections.insert(id, new_export);


For my own understanding: The connection vec contains one entry for each output, specifying how the input is connected to that output. Here we construct connection for input i to be:

[ [0], [], [], ... [0] ] progress health output 1 ... output i ... output n

I would have expected connection to also have empty frontier entries for the outputs between i and n. I assume we can skip those because missing entries default to the empty frontier?

petrosagg commented Dec 17, 2024

View reviewed changes

petrosagg marked this pull request as ready for review December 18, 2024 09:12

petrosagg requested a review from a team as a code owner December 18, 2024 09:12

petrosagg added 2 commits December 18, 2024 15:10

fix clippy lints

67c2f5d

Signed-off-by: Petros Angelatos <[email protected]>

teskje approved these changes Dec 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: infrastructure for independent source output streams #30858

storage: infrastructure for independent source output streams #30858

petrosagg commented Dec 17, 2024 •

edited

Loading

petrosagg Dec 17, 2024

petrosagg Dec 17, 2024

petrosagg Dec 17, 2024

petrosagg Dec 17, 2024

petrosagg Dec 17, 2024

teskje Dec 18, 2024

petrosagg Dec 17, 2024

petrosagg Dec 17, 2024

petrosagg Dec 17, 2024

petrosagg Dec 17, 2024

teskje Dec 18, 2024

teskje left a comment

teskje Dec 18, 2024

teskje Dec 18, 2024

teskje Dec 18, 2024

teskje Dec 18, 2024

storage: infrastructure for independent source output streams #30858

Are you sure you want to change the base?

storage: infrastructure for independent source output streams #30858

Conversation

petrosagg commented Dec 17, 2024 • edited Loading

Motivation

Tips for reviewer

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teskje left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrosagg commented Dec 17, 2024 •

edited

Loading