-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Spark Python Executor #1231
Refactor Spark Python Executor #1231
Conversation
def run_helper( | ||
spec: Spec, | ||
read_artifacts_func: Any, | ||
write_artifact_func: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are already parsing function objects, does it make sense to pass different function objects based on whether it's spark or not? In this way we don't even need to pass is_spark
and other stuff as arguments.
For example, could we do something like
if is_spark:
run_helper(spec, read_artifact_func=utils.read_spark_artifacts, write_artifact_func=utils.write_spark_artifacts, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, this is a huge improvements! I have some minor comments but I don't think we have to spend too much time iterating on this
read_artifacts_func: Any, | ||
write_artifact_func: Any, | ||
setup_connector_func: Any, | ||
is_spark: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have to pass this arg around, does it makes sense to explicitly pass a single Optional[spark.Session]
object to decide if spark is enabled? Also I'd like to remove **kwargs
and since it's not clear what to expect and how it's used. This pattern is more useful in cases like decorators where we support arbitrary inputs, which is not the case here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the correct pattern, however it requires that we import pyspark.sql
in the regular code path, which we want to avoid. The kwargs pattern was to avoid the import in non-Spark environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some stylistic/readability concerns from me!
- write_artifact_func: function used to write artifacts to storage layer | ||
- setup_connector_func: function to use to setup the connectors | ||
- is_spark Whether or not we are running in a Spark env. | ||
The only kwarg we expect is spark_session_obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we use the same docstring format as we do in the SDK (see the ones in client.py)?
) | ||
|
||
|
||
def run_helper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this execute_function_spec()
or something? run_helper()
doesn't tell me very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing for the other run_helper()
-> execute_data_spec()
or something
Describe your changes and why you are making these changes
This PR refactors the spark python executors and reduces the maintenance overhead.
The technical complexity was how to share code with similar definitions but one additional parameter, a SparkSession. We also want to avoid importing pyspark in our normal codepaths that don't execute in Spark environments. To do this, we pass in the differing functions as parameters (
read_artifacts
,write_artifact
,infer_type_artifact
, andsetup_connector
) while sending theSparkSession
object as akwarg
. This way the generic function passed in can be called in the same manner by both Spark and non-Spark codepaths.Related issue number (if any)
Loom demo (if any)
Checklist before requesting a review
python3 scripts/run_linters.py -h
for usage).run_integration_test
: Runs integration testsskip_integration_test
: Skips integration tests (Should be used when changes are ONLY documentation/UI)