-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample
#38310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pre-processing of sample augments is pretty complex, so make it a static method and reuse it in connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do need to share code between pyspark and spark connect python client, we should probably add a new module like pyspark-common
DataFrame.sampleDataFrame.sample
8953e7f to
c114ba4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, does spark connect python client depends on pyspark? Then it's not a thin client any more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is now depending on the pyspark. In fact it depends on pyspark since the first PR. For the short term it is ok cc @HyukjinKwon
I guess we will need to make a final decision how whether it should depend or not before making the python packaging and release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaliujia We should really consider this. The principle is to move code implementation to the server side as much as possible. We just moved the identifier parsing logic to server side, and we should probably do the same for parameter default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense.
@zhengruifeng I am thinking you can wrap this seed into a proto message and in that case the server side can know if this is set or not? In that case, the server side can does the random generation rather than using the value from proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example: #38275
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let me make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default bool value for proto is False so this is probably not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh The Plan definition is not Optional for withReplacement. In this case probably set it as False makes sense.
class Sample(LogicalPlan):
def __init__(
self,
child: Optional["LogicalPlan"],
lower_bound: float,
upper_bound: float,
with_replacement: bool,
seed: int,
) -> None:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyspark dataframe API has
@overload
def sample(self, fraction: float, seed: Optional[int] = ...) -> "DataFrame":
...
@overload
def sample(
self,
withReplacement: Optional[bool],
fraction: float,
seed: Optional[int] = ...,
) -> "DataFrame":
...
Can we match (as easy as copy the API into connect dataframe.py)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can discard those ones ? @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe my real question was, will we have an issue to be compatible with existing pyspark dataframe code (needs different imports, of course) if we discard such API? I see many other similar API existing for pyspark dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users may have to change their codes for this emigration, but I think this is also a chance to make some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We also can go to that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just leverage keyword-only argument which will make the logic much simpler. Actually we wanted to do it in PySpark API layer in the past. Since this is a new API layer, I think it;s a good chance to replace them. cc @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's a bit confusing at first glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we can break the signature, it would be:
def sample(
self,
fraction: float,
*,
withReplacement: Optional[bool] = None,
seed: Optional[int] = None,
) -> "DataFrame":
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withReplacement can be : bool = False if the default is False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea
c114ba4 to
374bffc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to define Seed out of Sample, otherwise there is no HasSeed method in the generated files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true. The has* messages are generated for non simple types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, maybe the jars were out of sync in that time, let me move Seed in Sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I always does a clean then build.
906f792 to
7506c56
Compare
|
LGTM |
4c137ac to
1bd8f65
Compare
1bd8f65 to
25f4f75
Compare
|
Merged to master. |
|
thank you guys |
### What changes were proposed in this pull request?
Implement `DataFrame.sample` in Connect
### Why are the changes needed?
for DataFrame API coverage
### Does this PR introduce _any_ user-facing change?
Yes, new API
```
def sample(
self,
fraction: float,
*,
withReplacement: bool = False,
seed: Optional[int] = None,
) -> "DataFrame":
```
### How was this patch tested?
added UT
Closes apache#38310 from zhengruifeng/connect_df_sample.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Implement
DataFrame.samplein ConnectWhy are the changes needed?
for DataFrame API coverage
Does this PR introduce any user-facing change?
Yes, new API
How was this patch tested?
added UT