🌊 Refactor API control flow for stream management by miltonhultgren · Pull Request #211696 · elastic/kibana

miltonhultgren · 2025-02-19T09:11:59Z

Background

This PR is a proposal for a different way to structure the Streams code flow based on some challenges faced while working on https://github.com/elastic/streams-program/issues/26 and discussed here and here, mainly around finding it difficult to decide where to place certain validations that need access to the state as a whole.
It is also in response to some expressed difficulty about how to add new stream types into the code base.

It aims to achieve 3 goals:

It is easy to add new stream types and there is a clear place where changes (new validation, new logic) for existing stream types happen, making the code easier to evolve over time
It is easier to improve the robustness of the system because there are clear phases where problems can be caught, fixed and rolled back
It lays some ground work for features such as bulk changes, dry runs and a health endpoint

In the future, this will most likely be handled by Elasticsearch to a large degree, as imagined in https://github.com/elastic/streams-program/discussions/30

The solution takes inspiration from the reconciliation / controller pattern that Kubernetes uses, where users specify a desired state and the system takes action towards reaching that step. But it is also somewhat more similar to how React's Virtual DOM works in that it happens in a single iteration.

Another key pattern is the Active Record pattern, we let each stream class contain all the logic for how to validate and modify that stream in Elasticsearch. The client and State class simply orchestrate the flow but defer all actual work and decision making to the stream classes.

Note: This PoC ignores the management of assets

Summary

The process takes the following steps:

A route accepts a request (upsert / delete) and translates it into one or more (for bulk) StreamChange objects before passing these to State.applyChanges method (which also takes a toggle for dry runs)
The current state of Streams is loaded by using the State class
The changes are then applied to the current state to derive the desired state [1]
The desired state is then validated, this is done by asking each individual stream if given the desired state and starting state, from the perspective of that individual stream, is it in a valid state (upserted or deleted correctly)
If the state is invalid, we return those errors and stop
Else we continue, if it's a dry run, we ask the desired state object for what has changed and report that in the shape of the Elasticsearch actions that would be attempted
Else we proceed to commit the changes to Elasticsearch by asking each changed stream to determine which Elasticsearch actions need to be performed to reach the desired state
These actions are then combined and sent to the ExecutionPlan class which does planning (mainly for actions around Unwired streams) and then handles executing the actions in the most parallel way but in the safe order
If any error happens, we attempt to revert back to the starting state by taking the changed streams and marking each stream as created based on the starting state and then getting the Elasticsearch actions for that and applying those

This PR also changes our resync endpoint to make use of the same rough strategy (load current state, mark all as created, get Elasticsearch actions and apply).

[1] Applying changes:

The current state is first cloned
Then for each change we see if it is a deletion or an upsert
Based on this we either mark existing streams for deletion or create/update existing streams
When creating a new stream instance we use the helper streamFromDefinition which is the only mapping between the definition documents and the Active Record-style stream type classes
As part of this, each stream that changes is marked in the desired state
The stream is passed the desired and current state and should update itself based on the change
The stream can return a set of cascading changes (taking the same format as the requested changes) which are executed directly after but we have a limit for how many rounds of cascading changes can happen to avoid infinite loops

Adding new stream types

Key in all of this is that the client and State classes don't know anything about any of the specific stream types, they know only of the StreamActiveRecord interface.
When adding a new stream type you need to implement this interface and update streamFromDefinition to create the right class for your new definition. Streams of different types should only interact with each other by creating cascading changes.

Possible follow up tasks

Introduce a lazy Elasticsearch cluster state cache because multiple places in the code access the same stuff over and over again
Make API endpoints the consume attemptChanges pass back the DesiredState and planned ElasticsearchActions as debug information based on a flag (maybe also all cascading changes)
Don't run cascading changes by default but run them if some flag is submitted based on https://github.com/elastic/streams-program/discussions/230
Wrap attemptChanges and resync with the new LockManager [Obs AI Assistant] Distributed lock manager #216397
Unit test WiredStream, UnwiredStream and GroupStream
Clean up old sync helpers
Wrap ES calls to get better stack traces for errors

Out of scope

Asset linking and content pack installation (it's probably okay for these to continue to use the asset client directly since there is less domain logic and no cascading changes involved)

…-flow-poc

x-pack/platform/plugins/shared/streams/server/lib/streams/state_management/types.ts

x-pack/platform/plugins/shared/streams/server/lib/streams/state_management/state.ts

x-pack/platform/plugins/shared/streams/server/lib/streams/state_management/wired_stream.ts

x-pack/platform/plugins/shared/streams/server/lib/streams/client.ts

x-pack/platform/plugins/shared/streams/server/lib/streams/state_management/wired_stream.ts

x-pack/platform/plugins/shared/streams/server/lib/streams/state_management/types.ts

x-pack/platform/plugins/shared/streams/server/lib/streams/state_management/wired_stream.ts

x-pack/platform/plugins/shared/streams/server/lib/streams/state_management/state.ts

…table helper for State, fix bugs, split validation

…-flow-poc

gsoldevila

Core changes LGTM

…-flow-poc

flash1293 · 2025-04-04T14:18:56Z

This looks mostly good to me, one thing I found is the following:

When trying to create a classic stream that doesn't exist as data stream, it fails super late:

PUT kbn:/api/streams/logs-nonexisting-xxx
  {
    "stream": {
      "ingest": {
        "lifecycle": {
          "inherit": {}
        },
        "processing": [
          {
            "grok": {
              "if": {
                "always": {}
              },
              "ignore_failure": true,
              "field": "message",
              "patterns": [
                "%{WORD:xxx}"
              ],
              "pattern_definitions": {},
              "ignore_missing": true
            }
          }
        ],
        "unwired": {}
      }},
      "dashboards": []
    }

Returns

{
  "statusCode": 500,
  "error": "Internal Server Error",
  "message": """Failed to rollback attempted changes: Failed to determine Elasticsearch actions: index_not_found_exception
	Root causes:
		index_not_found_exception: no such index [logs-nonexisting-xxx]. Original error: FailedToDetermineElasticsearchActionsError: Failed to determine Elasticsearch actions: index_not_found_exception
	Root causes:
		index_not_found_exception: no such index [logs-nonexisting-xxx]""",
  "attributes": {
    "data": null
  }
}

What about adding something to the upsert validation that checks whether the underlying datastream exists if this._processingChanged || this._lifeCycleChanged? Attempting this is not a supported functionality and shouldn't fail with a 500, but a 400

Love the separate validation hooks btw, much easier to follow.

flash1293 · 2025-04-04T14:24:24Z

.../platform/plugins/shared/streams/server/lib/streams/state_management/streams/wired_stream.ts

+    const existsInStartingState = startingState.has(this._definition.name);
+
+    if (!existsInStartingState) {
+      // TODO in this check, make sure the existing data stream is not a stream-created one (if it is, state might be out of sync, but we can fix it)


should we still do this? I think I put this here, it's about if the data stream exists but we created it, then it's probably fine to "grandfather" it in. But we also can not do this for now

I'll skip it for now

.../platform/plugins/shared/streams/server/lib/streams/state_management/streams/wired_stream.ts

flash1293 · 2025-04-04T14:25:55Z

.../platform/plugins/shared/streams/server/lib/streams/state_management/streams/wired_stream.ts

+    desiredState: State,
+    startingState: State
+  ): Promise<ValidationResult> {
+    return { isValid: true, errors: [] };


Just a note: This makes me think adding the "force=false" query param is not the only change we need to do - here the validation hook relies on the cascading changes doing its thing, which is probably OK for now, but we will need to check through all the validation logic whether it relies on it this way.

…-flow-poc

miltonhultgren · 2025-04-05T11:17:07Z

@flash1293 Added the validation to the upsert of UnwiredStream

flash1293

LGTM after the last round of testing.

Only thing I'm not sure about is the block list in updateOrRolloverDataStream - would be great to get feedback from the ES team on that.

…-flow-poc

elasticmachine · 2025-04-08T10:40:31Z

💚 Build Succeeded

Buildkite Build
Commit: 9143ec9

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/es-errors`	6	8	+2
`@kbn/streams-schema`	374	377	+3
total			+5

Unknown metric groups

API count

id	before	after	diff
`@kbn/es-errors`	11	13	+2
`@kbn/streams-schema`	388	392	+4
total			+6

ESLint disabled in files

id	before	after	diff
`streams`	1	3	+2

Total ESLint disabled count

id	before	after	diff
`streams`	3	5	+2

History

💚 Build #290943 succeeded a546454
💛 Build #290874 was flaky 718acbc
💚 Build #290616 succeeded 3138ef0
💛 Build #290516 was flaky 4856263
💔 Build #290507 failed a7f7749
💔 Build #290506 failed dfef123

kibanamachine · 2025-04-08T11:03:14Z

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/14331874800

kibanamachine · 2025-04-08T11:09:06Z

💔 All backports failed

Status	Branch	Result
❌	8.x	Backport failed because of merge conflicts You might need to backport the following PRs to 8.x: - feat(streams): add significant events and queries API (#216221)

Manual backport

To create the backport manually run:

node scripts/backport --pr 211696

Questions ?

Please refer to the Backport tool documentation

Manual backport of #211696 --------- Co-authored-by: Joe Reuter <johannes.reuter@elastic.co> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>

miltonhultgren added 12 commits February 18, 2025 16:49

Initial flow for attemptChanges

5e787e6

Draft for upsert change of wired stream

5dabf5b

Draft for validation for wired stream

ff7f463

Merge branch 'main' of github.com:elastic/kibana into streams-control…

b5b560a

…-flow-poc

Refine StreamActiveRecord type and introduce StreamsRepository

49821a1

Merge branch 'main' of github.com:elastic/kibana into streams-control…

7741d9b

…-flow-poc

Move new files to platform

a0b3195

Merge branch 'main' of github.com:elastic/kibana into streams-control…

13e7f4d

…-flow-poc

Add way to track changes

b500e58

Merge branch 'main' of github.com:elastic/kibana into streams-control…

c82ef85

…-flow-poc

Draft for commit (upsert and delete) and attemptRollback

93263fe

Merge StreamRepository into State

a7a34a7

miltonhultgren changed the title ~~🌊 State management and control flow PoC~~ [PoC] 🌊 State management and control flow Feb 24, 2025

miltonhultgren added 2 commits February 24, 2025 20:44

Pass states to markForDeletion

75f6945

Delete GroupStream to make PR more clear

8b81080