-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options #51507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
JiaqiWang18
wants to merge
17
commits into
apache:master
from
JiaqiWang18:SPARK-52810-pipelines-cli-refresh-options
Closed
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
b24698c
cli args
jackywang-db 1d6ec2c
working 1 test
jackywang-db b8ef4c3
2 test pass
jackywang-db 7fbe8e7
more tests
jackywang-db 109de10
add server side validation
jackywang-db 7e990c5
add validation tests
jackywang-db e494a57
python cli tests
jackywang-db 4105c97
modify backend tests
jackywang-db e580bb4
test overhaul
jackywang-db 4d26d77
fmt
jackywang-db e94c85f
fmt
jackywang-db 695054f
fmt
jackywang-db 1abe5c3
fmt
jackywang-db 1693ac5
address feedback
jackywang-db e56d39b
rename proto
jackywang-db b04ac24
fmt
jackywang-db f21d79f
nit
jackywang-db File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,7 +28,7 @@ | |
| import yaml | ||
| from dataclasses import dataclass | ||
| from pathlib import Path | ||
| from typing import Any, Generator, Mapping, Optional, Sequence | ||
| from typing import Any, Generator, List, Mapping, Optional, Sequence | ||
|
|
||
| from pyspark.errors import PySparkException, PySparkTypeError | ||
| from pyspark.sql import SparkSession | ||
|
|
@@ -217,8 +217,36 @@ def change_dir(path: Path) -> Generator[None, None, None]: | |
| os.chdir(prev) | ||
|
|
||
|
|
||
| def run(spec_path: Path) -> None: | ||
| """Run the pipeline defined with the given spec.""" | ||
| def run( | ||
| spec_path: Path, | ||
| full_refresh: Sequence[str], | ||
| full_refresh_all: bool, | ||
| refresh: Sequence[str], | ||
| ) -> None: | ||
| """Run the pipeline defined with the given spec. | ||
|
|
||
| :param spec_path: Path to the pipeline specification file. | ||
| :param full_refresh: List of datasets to reset and recompute. | ||
| :param full_refresh_all: Perform a full graph reset and recompute. | ||
| :param refresh: List of datasets to update. | ||
| """ | ||
| # Validate conflicting arguments | ||
| if full_refresh_all: | ||
| if full_refresh: | ||
JiaqiWang18 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| raise PySparkException( | ||
| errorClass="CONFLICTING_PIPELINE_REFRESH_OPTIONS", | ||
| messageParameters={ | ||
| "conflicting_option": "--full_refresh", | ||
| }, | ||
| ) | ||
| if refresh: | ||
| raise PySparkException( | ||
| errorClass="CONFLICTING_PIPELINE_REFRESH_OPTIONS", | ||
| messageParameters={ | ||
| "conflicting_option": "--refresh", | ||
| }, | ||
| ) | ||
|
|
||
| log_with_curr_timestamp(f"Loading pipeline spec from {spec_path}...") | ||
| spec = load_pipeline_spec(spec_path) | ||
|
|
||
|
|
@@ -242,20 +270,52 @@ def run(spec_path: Path) -> None: | |
| register_definitions(spec_path, registry, spec) | ||
|
|
||
| log_with_curr_timestamp("Starting run...") | ||
| result_iter = start_run(spark, dataflow_graph_id) | ||
| result_iter = start_run( | ||
| spark, | ||
| dataflow_graph_id, | ||
| full_refresh=full_refresh, | ||
| full_refresh_all=full_refresh_all, | ||
| refresh=refresh, | ||
| ) | ||
| try: | ||
| handle_pipeline_events(result_iter) | ||
| finally: | ||
| spark.stop() | ||
|
|
||
|
|
||
| def parse_table_list(value: str) -> List[str]: | ||
| """Parse a comma-separated list of table names, handling whitespace.""" | ||
| return [table.strip() for table in value.split(",") if table.strip()] | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description="Pipeline CLI") | ||
| subparsers = parser.add_subparsers(dest="command", required=True) | ||
|
|
||
| # "run" subcommand | ||
| run_parser = subparsers.add_parser("run", help="Run a pipeline.") | ||
| run_parser = subparsers.add_parser( | ||
| "run", | ||
| help="Run a pipeline. If no refresh options specified, " | ||
| "a default incremental update is performed.", | ||
| ) | ||
| run_parser.add_argument("--spec", help="Path to the pipeline spec.") | ||
| run_parser.add_argument( | ||
| "--full-refresh", | ||
| type=parse_table_list, | ||
| action="extend", | ||
| help="List of datasets to reset and recompute (comma-separated).", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here and below, should we document default behavior if this arg is not specified at all?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will |
||
| default=[], | ||
| ) | ||
| run_parser.add_argument( | ||
| "--full-refresh-all", action="store_true", help="Perform a full graph reset and recompute." | ||
| ) | ||
| run_parser.add_argument( | ||
| "--refresh", | ||
| type=parse_table_list, | ||
| action="extend", | ||
| help="List of datasets to update (comma-separated).", | ||
| default=[], | ||
| ) | ||
|
|
||
| # "init" subcommand | ||
| init_parser = subparsers.add_parser( | ||
|
|
@@ -283,6 +343,11 @@ def run(spec_path: Path) -> None: | |
| else: | ||
| spec_path = find_pipeline_spec(Path.cwd()) | ||
|
|
||
| run(spec_path=spec_path) | ||
| run( | ||
| spec_path=spec_path, | ||
| full_refresh=args.full_refresh, | ||
| full_refresh_all=args.full_refresh_all, | ||
| refresh=args.refresh, | ||
| ) | ||
| elif args.command == "init": | ||
| init(args.name) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level question: did we consider putting refresh selection options in the pipeline spec, rather than as a CLI arg?
More generally, what's the philosophy for whether a configuration should be accepted as a CLI arg vs a pipeline spec field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we expect it to vary across run for the same pipeline, it should be a CLI arg. If we expect it to be static for a pipeline, it should live in the spec. I would expect selections to vary across runs.