Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR addresses #46683 (comment) comment within Python, by using ID at the plan instead of DataFrame itself.

Why are the changes needed?

Because the DataFrame holds the relation ID, if DataFrame B are derived from DataFrame A, and DataFrame A is garbage-collected, then the cache might not exist anymore. See the example below:

df = spark.range(1).localCheckpoint()
df2 = df.repartition(10)
del df
df2.collect()
pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) No DataFrame with id a4efa660-897c-4500-bd4e-bd57cd0263d2 is found in the session cd4764b4-90a9-4249-9140-12a6e4a98cd3

Does this PR introduce any user-facing change?

No, the main change has not been released out yet.

How was this patch tested?

Manually tested, and added a unittest.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member Author

cc @zhengruifeng @hvanhovell

@HyukjinKwon HyukjinKwon force-pushed the SPARK-48258-followup branch from d8d5a47 to 549dae7 Compare May 22, 2024 02:42
@HyukjinKwon HyukjinKwon force-pushed the SPARK-48258-followup branch from 549dae7 to 21f2d40 Compare May 22, 2024 03:01
)

del df
gc.collect()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike JVM, this does trigger the full GC


def __del__(self) -> None:
session = self._spark_session
# If session is already closed, all cached DataFrame should be released.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to release those cached dataframes in server side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.. we can only tell when to release at the client side

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the change, we can only know if the session is disconnected, and we're already releasing all in this case.

@HyukjinKwon
Copy link
Member Author

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants