Refactor Spark Python Executor #1231

hsubbaraj-spiral · 2023-04-20T22:42:28Z

Describe your changes and why you are making these changes

This PR refactors the spark python executors and reduces the maintenance overhead.

The technical complexity was how to share code with similar definitions but one additional parameter, a SparkSession. We also want to avoid importing pyspark in our normal codepaths that don't execute in Spark environments. To do this, we pass in the differing functions as parameters (read_artifacts, write_artifact, infer_type_artifact, and setup_connector) while sending the SparkSession object as a kwarg. This way the generic function passed in can be called in the same manner by both Spark and non-Spark codepaths.

Related issue number (if any)

Loom demo (if any)

Checklist before requesting a review

I have created a descriptive PR title. The PR title should complete the sentence "This PR...".
I have performed a self-review of my code.
I have included a small demo of the changes. For the UI, this would be a screenshot or a Loom video.
If this is a new feature, I have added unit tests and integration tests.
I have run the integration tests locally and they are passing.
I have run the linter script locally (See python3 scripts/run_linters.py -h for usage).
All features on the UI continue to work correctly.
Added one of the following CI labels:
- run_integration_test: Runs integration tests
- skip_integration_test: Skips integration tests (Should be used when changes are ONLY documentation/UI)

likawind · 2023-04-22T01:13:50Z

src/python/aqueduct_executor/operators/connectors/data/execute.py

+def run_helper(
+    spec: Spec,
+    read_artifacts_func: Any,
+    write_artifact_func: Any,


If we are already parsing function objects, does it make sense to pass different function objects based on whether it's spark or not? In this way we don't even need to pass is_spark and other stuff as arguments.

For example, could we do something like

if is_spark: run_helper(spec, read_artifact_func=utils.read_spark_artifacts, write_artifact_func=utils.write_spark_artifacts, ...)

likawind

Looks great, this is a huge improvements! I have some minor comments but I don't think we have to spend too much time iterating on this

likawind · 2023-04-22T01:16:14Z

src/python/aqueduct_executor/operators/connectors/data/execute.py

+    read_artifacts_func: Any,
+    write_artifact_func: Any,
+    setup_connector_func: Any,
+    is_spark: bool,


If we have to pass this arg around, does it makes sense to explicitly pass a single Optional[spark.Session] object to decide if spark is enabled? Also I'd like to remove **kwargs and since it's not clear what to expect and how it's used. This pattern is more useful in cases like decorators where we support arbitrary inputs, which is not the case here

This is the correct pattern, however it requires that we import pyspark.sql in the regular code path, which we want to avoid. The kwargs pattern was to avoid the import in non-Spark environments.

kenxu95

Just some stylistic/readability concerns from me!

kenxu95 · 2023-04-24T15:15:17Z

src/python/aqueduct_executor/operators/connectors/data/execute.py

+    - write_artifact_func: function used to write artifacts to storage layer
+    - setup_connector_func: function to use to setup the connectors
+    - is_spark Whether or not we are running in a Spark env.
+    The only kwarg we expect is spark_session_obj


nit: Can we use the same docstring format as we do in the SDK (see the ones in client.py)?

kenxu95 · 2023-04-24T15:19:16Z

src/python/aqueduct_executor/operators/function_executor/execute.py

+    )
+
+
+def run_helper(


Can we call this execute_function_spec() or something? run_helper() doesn't tell me very much.

Same thing for the other run_helper() -> execute_data_spec() or something

hsubbaraj added 8 commits April 19, 2023 20:12

wrote helper

d7f80f0

updated function executor code

6aa73b6

fix typing issues

be876a4

remove unneeded imports

5acb444

typo

366d411

refactored data integrations

74a91a5

added spark_session_obj

3cf7111

added comments

cef32ea

hsubbaraj-spiral added the run_integration_test Triggers integration tests label Apr 20, 2023

hsubbaraj-spiral requested review from kenxu95 and likawind April 20, 2023 22:56

likawind reviewed Apr 22, 2023

View reviewed changes

kenxu95 approved these changes Apr 24, 2023

View reviewed changes

hsubbaraj added 2 commits April 24, 2023 16:23

add comments + rename to execute_function_spec

7d8f93b

change to execute_data_spec

61b285e

hsubbaraj-spiral merged commit e918800 into main Apr 24, 2023

hsubbaraj-spiral deleted the eng-2388-reduce-maintenance-overhead-of-spark branch April 24, 2023 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Spark Python Executor #1231

Refactor Spark Python Executor #1231

hsubbaraj-spiral commented Apr 20, 2023 •

edited

Loading

likawind Apr 22, 2023

likawind left a comment

likawind Apr 22, 2023

hsubbaraj-spiral Apr 23, 2023

kenxu95 left a comment

kenxu95 Apr 24, 2023

kenxu95 Apr 24, 2023

kenxu95 Apr 24, 2023

		)


		def run_helper(

Refactor Spark Python Executor #1231

Refactor Spark Python Executor #1231

Conversation

hsubbaraj-spiral commented Apr 20, 2023 • edited Loading

Describe your changes and why you are making these changes

Related issue number (if any)

Loom demo (if any)

Checklist before requesting a review

likawind Apr 22, 2023

Choose a reason for hiding this comment

likawind left a comment

Choose a reason for hiding this comment

likawind Apr 22, 2023

Choose a reason for hiding this comment

hsubbaraj-spiral Apr 23, 2023

Choose a reason for hiding this comment

kenxu95 left a comment

Choose a reason for hiding this comment

kenxu95 Apr 24, 2023

Choose a reason for hiding this comment

kenxu95 Apr 24, 2023

Choose a reason for hiding this comment

kenxu95 Apr 24, 2023

Choose a reason for hiding this comment

hsubbaraj-spiral commented Apr 20, 2023 •

edited

Loading