Skip to content

Conversation

@JiaqiWang18
Copy link
Contributor

@JiaqiWang18 JiaqiWang18 commented Jul 16, 2025

What changes were proposed in this pull request?

We want to give user the ability to choose a subset of datasets (ex: tables, materialized views) to include in a run.
And the ability to specify if they should ran as regular refresh or full refresh.
Below arguments being added to the spark-pipelines CLI to achieve this

--full-refresh: List of datasets to reset and recompute.

--full-refresh-all: Boolean, whether to perform a full graph reset and recompute.

--refresh: List of datasets to update.

If no options are specified, the default is to perform a refresh for all datasets in the pipeline.

To enable above:

  • new CLI options are added to the python CLI
  • proto changes are made to allow passing them to spark
  • changes in spark pipelines codebase to use TableFilter to control graph refresh

Why are the changes needed?

These changes are needed because we want to give users option to control what to run and how to run for their pipelines.

Does this PR introduce any user-facing change?

Yes, new CLI options are being added. However, SDP haven't been released yet so no user should be impacted.

How was this patch tested?

Added new test suite in the python CLI to verify argument parsing.
Added new test suite in scala codebase to use the newly added CLI options to run a full pipeline to verify behavior.

Was this patch authored or co-authored using generative AI tooling?

No

@JiaqiWang18 JiaqiWang18 changed the title [WIP][SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options [SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options Jul 16, 2025
@JiaqiWang18
Copy link
Contributor Author

@AnishMahto

@JiaqiWang18
Copy link
Contributor Author

@sryza

Copy link
Contributor

@AnishMahto AnishMahto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing out some thoughts! Haven't looked at tests yet.

if full_refresh_all:
if full_refresh:
raise PySparkException(
errorClass="CONFLICTING_PIPELINE_REFRESH_OPTIONS", messageParameters={}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on having sub error classes for mismatched combinations? Or maybe just pass along which two configs are conflicting as a message parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logic to pass along the conflicting option

result = []
for table_list in table_lists:
result.extend(table_list)
return result if result else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If result is an empty list, do we still want to return None? Or should we just return the empty list? What is the implication of either here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this by using the extend option in arg parser to avoid creating nested list.

"--full-refresh",
type=parse_table_list,
action="append",
help="List of datasets to reset and recompute (comma-separated).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, should we document default behavior if this arg is not specified at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will extend split using commas?

run(spec_path=spec_path)
run(
spec_path=spec_path,
full_refresh=flatten_table_lists(args.full_refresh),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to flatten args.full_refresh and args.refresh? I thought we defined their types with the parse_table_list function, which returns List[str]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case if user provide the same args multiple times.
Ex: (--full_refresh: "a,b" --full_refresh: "c,d"). Then we will receive a nested list [["a","b"],["c"]]. Need to perform a flattening to transform it into a 1D list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to mark this argument field as extend rather than append, would we still need to do any manual flattening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, extend creates a 1D list directly.

sessionHolder: SessionHolder): Unit = {
val dataflowGraphId = cmd.getDataflowGraphId
val graphElementRegistry = DataflowGraphRegistry.getDataflowGraphOrThrow(dataflowGraphId)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract all this added logic to deduce the full refresh and regular refresh table filters into its own function? And then as part of the scala docs, map the expected filter results depending on what combination of full refresh and partial refresh is selected

Copy link
Contributor Author

@JiaqiWang18 JiaqiWang18 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted a createTableFilters function

Comment on lines 259 to 284
if (refreshTables.nonEmpty && fullRefreshTables.nonEmpty) {
// check if there is an intersection between the subset
val intersection = refreshTableNames.intersect(fullRefreshTableNames)
if (intersection.nonEmpty) {
throw new IllegalArgumentException(
"Datasets specified for refresh and full refresh cannot overlap: " +
s"${intersection.mkString(", ")}")
}
}

val fullRefreshTablesFilter: TableFilter = if (fullRefreshAll) {
AllTables
} else if (fullRefreshTables.nonEmpty) {
SomeTables(fullRefreshTableNames)
} else {
NoTables
}

val refreshTablesFilter: TableFilter =
if (refreshTables.nonEmpty) {
SomeTables(refreshTableNames)
} else if (fullRefreshTablesFilter != NoTables) {
NoTables
} else {
AllTables
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an optional nit, but as a code reader it's difficult for me to reason about the combinations of fullRefreshTables and refreshTables when reading them as sequential but related validation here.

My suggestion would be to restructure this as a match statement, that explicitly handles each combination. Ex.

(fullRefreshTables, refreshTableNames) match {
      case (Nil, Nil) => ...
      case (fullRefreshTables, Nil) => ...
      case ...
}

Copy link
Contributor Author

@JiaqiWang18 JiaqiWang18 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted a createTableFilters function

from dataclasses import dataclass
from pathlib import Path
from typing import Any, Generator, Mapping, Optional, Sequence
from typing import Any, Generator, Mapping, Optional, Sequence, List
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of alphabetical order: you may need to run dev/reformat-python to format this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it didn't reformat this but I manually reordered it


def run(spec_path: Path) -> None:
"""Run the pipeline defined with the given spec."""
def run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect it to vary across run for the same pipeline, it should be a CLI arg. If we expect it to be static for a pipeline, it should live in the spec. I would expect selections to vary across runs.

not should_test_connect or not have_yaml,
connect_requirement_message or yaml_requirement_message,
)
class CLIValidationTests(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a meaningful difference between the kinds of tests that are included in this class and the kinds of tests that included in the other class in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think they can be combined into one.

@JiaqiWang18 JiaqiWang18 requested review from AnishMahto and sryza July 17, 2025 18:26
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sryza sryza closed this in 9204b05 Jul 17, 2025
@sryza
Copy link
Contributor

sryza commented Jul 17, 2025

Merged to master

dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request Oct 1, 2025
…th `4.1.0-preview2`

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`.

### Why are the changes needed?

There are many changes from Apache Spark 4.1.0.

- apache/spark#52342
- apache/spark#52256
- apache/spark#52271
- apache/spark#52242
- apache/spark#51473
- apache/spark#51653
- apache/spark#52072
- apache/spark#51561
- apache/spark#51563
- apache/spark#51489
- apache/spark#51507
- apache/spark#51462
- apache/spark#51464
- apache/spark#51442

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`.

```
$ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect

$ grep 'This file contained no services' *
catalog.grpc.swift:// This file contained no services.
commands.grpc.swift:// This file contained no services.
common.grpc.swift:// This file contained no services.
example_plugins.grpc.swift:// This file contained no services.
expressions.grpc.swift:// This file contained no services.
ml_common.grpc.swift:// This file contained no services.
ml.grpc.swift:// This file contained no services.
pipelines.grpc.swift:// This file contained no services.
relations.grpc.swift:// This file contained no services.
types.grpc.swift:// This file contained no services.

$ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #250 from dongjoon-hyun/SPARK-53777.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants