Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

984 refactoring separate planning and execution phases in key dataset manipulation services #994

Conversation

zaychenko-sergei
Copy link
Contributor

@zaychenko-sergei zaychenko-sergei commented Dec 17, 2024

Description

Closes: #984

Extracted "planner" and "executor" for compacting, reset, set watermark, push ingest, partially polling ingest.
Renamed long-running "executors" to "agents".
Introduced MetadataQueryService to absorb simple queries that do not have to be defined at the level of metadata chian from the interface point of view.

Checklist before requesting a review

  • Unit and integration tests added
  • Compatibility:
    • Network APIs: ✅
    • Workspace layout and metadata: ✅
    • Configuration: ✅
    • Container images: ✅
  • Observability:
    • Tracing / metrics: ✅
    • Alerts: ✅
  • Documentation:
  • Downstream effects:

src/infra/ingest-datafusion/src/writer.rs Outdated Show resolved Hide resolved
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#[async_trait::async_trait]
impl MetadataQueryService for MetadataQueryServiceImpl {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would these methods be better suited in MetadataChainExt extension trait?

To me a service encapsulates complex interactions between several components, but here we have nothing but a stateless wrapper on top of MetadataChain's visitor interface.

MetadataChain is a place where we also validate complex domain rules about what sequences of events are valid an which aren't (e.g. two active push sources with same name) so I don't think an argument that this is some "higher-level logic" works in this case either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I have a feeling that for high level queries to dataset's current state, we shouldn't be considering that there is some metadata chain behind it. We just want to know key characteristics of the dataset's current state. The fact that this maps to MetadataChain algorithm is an implementation detail. Besides, these questions could be answered with database in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetadataChainExt deals with a much lower level of abstraction. This is mostly about visiting the blocks one way or another, emphasizing explicit knowledge of chain structure.

@zaychenko-sergei zaychenko-sergei force-pushed the 984-refactoring-separate-planning-and-execution-phases-in-key-dataset-manipulation-services branch from 00506be to 6abd459 Compare December 18, 2024 10:07
@zaychenko-sergei zaychenko-sergei merged commit 447c30c into master Dec 18, 2024
0 of 6 checks passed
@zaychenko-sergei zaychenko-sergei deleted the 984-refactoring-separate-planning-and-execution-phases-in-key-dataset-manipulation-services branch December 18, 2024 10:07
s373r added a commit that referenced this pull request Dec 19, 2024
* Do not show usage error for --all flag (#960)

* Do not show usage error for --all flag

When --all flag is set for the `repo delete` command,
and there are no repositories to delete, do not shoow usage error.

* Improve args validation

* Improve args validation, e2e tests

* Typo corrected in feature flags (#974)

* Images, kamu-base-git: fix collision of executable files (#975)

* 868 api server provide feature flags for UI (#976)

Separated runtime and UI configuration flags. UI config is provided by API server too.

* Release v0.210.0 + minor deps

* 854 persistent storage of dataset dependencies graph (#973)

Dependency graph service moved to 'datasets' domain.
Defined dataset dependency repository interface and created 3 implementations.
No more postponed initialization, organized initial setup in the form of an indexer.
Added telemetry extensions on the way.
Tests for repositories, stabilized other tests.
Cascading effect on delete within the dataset entry domain.

* v0.211.0 + minor deps

* Fixed image building (#977)

Replaced cascade delete of dataset entries in graph with more explicit events to allow orphan upstream dependencies where only ID is given

* Upgrade to datafusion 43

* Use thiserror v2 throughout

* trust-dns-resolver => hickory-resolver + minor deps

* Fix non-sequential offsets on ingest

* 0.212.0

* Use KAMU_CONTAINER_RUNTIME_TYPE env var in Makefile (#991)

* Use KAMU_CONTAINER_RUNTIME_TYPE env var in Makefile
* Make podman default engine for e2e tests

* Backporting changes from Private Datasets feature branch (#992)

* Backport tweaks

* Add doc strings

* Remove unused deps

* Remove unactual test

* CHANGELOG.md: update

* Tips after self-review

* Delete env var on dataset delete (#993)

* Delete env var on dataset delete

* 984 refactoring separate planning and execution phases in key dataset manipulation services (#994)

* Draft split of `CompactionService` into planner and execution parts

* Compaction cleanups

* Compacting more cleanups

* Compacting: read old HEAD on planning phase

* Reset service split on planner and execution

* Extracted `MetadataQueryService` - to query polling, push sources and set transform, instead of ingest/transform planners

* DataWriterMetadataState became part of polling ingest item at the planning phase

* Setting watermark : separate planner and execution service

* Push ingest service prepared for split

* Push ingest split on planning and executing

* Made some order in infra/core services

* {Flow,Task,Outbox}Executor=>Agent

* Unified naming of planners and executors

* Revised telemetry in refactored components

* Review: DataWriterDataFusionBuilder flattened

* changelog

* v0.123.0 + minor deps

* kamu-dev-base: include short commit hash as well (#995)

* v0.213.1: less agressive telemetry with `DataWriterMetadataState`

---------

Co-authored-by: Andrii Demus <[email protected]>
Co-authored-by: Sergei Zaychenko <[email protected]>
Co-authored-by: Sergii Mikhtoniuk <[email protected]>
Co-authored-by: Roman Boiko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactoring: Separate planning and execution phases in key dataset manipulation services
2 participants