[SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample` #38310

zhengruifeng · 2022-10-19T08:02:43Z

What changes were proposed in this pull request?

Implement DataFrame.sample in Connect

Why are the changes needed?

for DataFrame API coverage

Does this PR introduce any user-facing change?

Yes, new API

    def sample(
        self,
        fraction: float,
        *,
        withReplacement: bool = False,
        seed: Optional[int] = None,
    ) -> "DataFrame":

How was this patch tested?

added UT

zhengruifeng · 2022-10-19T08:04:30Z

python/pyspark/sql/dataframe.py

the pre-processing of sample augments is pretty complex, so make it a static method and reuse it in connect

If we do need to share code between pyspark and spark connect python client, we should probably add a new module like pyspark-common

zhengruifeng · 2022-10-19T11:12:17Z

cc @HyukjinKwon @amaliujia @cloud-fan @grundprinzip

cloud-fan · 2022-10-19T14:27:54Z

python/pyspark/sql/connect/dataframe.py

oh, does spark connect python client depends on pyspark? Then it's not a thin client any more...

yes this is now depending on the pyspark. In fact it depends on pyspark since the first PR. For the short term it is ok cc @HyukjinKwon

I guess we will need to make a final decision how whether it should depend or not before making the python packaging and release.

cloud-fan · 2022-10-19T14:29:47Z

python/pyspark/sql/connect/dataframe.py

@amaliujia We should really consider this. The principle is to move code implementation to the server side as much as possible. We just moved the identifier parsing logic to server side, and we should probably do the same for parameter default values.

This makes sense.

@zhengruifeng I am thinking you can wrap this seed into a proto message and in that case the server side can know if this is set or not? In that case, the server side can does the random generation rather than using the value from proto.

This is an example: #38275

yeah, let me make this change

amaliujia · 2022-10-19T20:20:32Z

python/pyspark/sql/connect/dataframe.py

The default bool value for proto is False so this is probably not needed.

oh The Plan definition is not Optional for withReplacement. In this case probably set it as False makes sense.

class Sample(LogicalPlan): def __init__( self, child: Optional["LogicalPlan"], lower_bound: float, upper_bound: float, with_replacement: bool, seed: int, ) -> None:

amaliujia · 2022-10-19T20:25:11Z

python/pyspark/sql/connect/dataframe.py

The pyspark dataframe API has

@overload def sample(self, fraction: float, seed: Optional[int] = ...) -> "DataFrame": ... @overload def sample( self, withReplacement: Optional[bool], fraction: float, seed: Optional[int] = ..., ) -> "DataFrame": ...

Can we match (as easy as copy the API into connect dataframe.py)?

I guess we can discard those ones ? @HyukjinKwon

Maybe my real question was, will we have an issue to be compatible with existing pyspark dataframe code (needs different imports, of course) if we discard such API? I see many other similar API existing for pyspark dataframe.

users may have to change their codes for this emigration, but I think this is also a chance to make some changes.

Sure. We also can go to that direction.

python/pyspark/sql/connect/plan.py

HyukjinKwon · 2022-10-20T01:17:51Z

python/pyspark/sql/connect/dataframe.py

Maybe we should just leverage keyword-only argument which will make the logic much simpler. Actually we wanted to do it in PySpark API layer in the past. Since this is a new API layer, I think it;s a good chance to replace them. cc @ueshin

yes, that's a bit confusing at first glance.

Yes, if we can break the signature, it would be:

def sample( self, fraction: float, *, withReplacement: Optional[bool] = None, seed: Optional[int] = None, ) -> "DataFrame": ...

withReplacement can be : bool = False if the default is False.

I like this idea

zhengruifeng · 2022-10-20T05:16:08Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

I need to define Seed out of Sample, otherwise there is no HasSeed method in the generated files

This is not true. The has* messages are generated for non simple types.

you are right, maybe the jars were out of sync in that time, let me move Seed in Sample

Yeah I always does a clean then build.

amaliujia · 2022-10-20T18:35:38Z

LGTM

nit fix fix fix lint mark as todo mark as todo make seed a msg mv seed outside of sample mv seed outside of sample nit nit mv Seed into Sample fix scala lint change signature

HyukjinKwon · 2022-10-21T08:18:25Z

Merged to master.

zhengruifeng · 2022-10-21T11:04:22Z

thank you guys

### What changes were proposed in this pull request? Implement `DataFrame.sample` in Connect ### Why are the changes needed? for DataFrame API coverage ### Does this PR introduce _any_ user-facing change? Yes, new API ``` def sample( self, fraction: float, *, withReplacement: bool = False, seed: Optional[int] = None, ) -> "DataFrame": ``` ### How was this patch tested? added UT Closes apache#38310 from zhengruifeng/connect_df_sample. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

zhengruifeng commented Oct 19, 2022

View reviewed changes

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 19, 2022

zhengruifeng marked this pull request as ready for review October 19, 2022 10:00

zhengruifeng changed the title ~~[SPARK-40839][CONNECT][PYTHON][WIP] Implement DataFrame.sample~~ [SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample Oct 19, 2022

zhengruifeng force-pushed the connect_df_sample branch 2 times, most recently from 8953e7f to c114ba4 Compare October 19, 2022 10:13

cloud-fan reviewed Oct 19, 2022

View reviewed changes

amaliujia reviewed Oct 19, 2022

View reviewed changes

python/pyspark/sql/connect/plan.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 20, 2022

View reviewed changes

zhengruifeng force-pushed the connect_df_sample branch from c114ba4 to 374bffc Compare October 20, 2022 04:37

zhengruifeng commented Oct 20, 2022

View reviewed changes

zhengruifeng force-pushed the connect_df_sample branch from 906f792 to 7506c56 Compare October 20, 2022 09:44

zhengruifeng force-pushed the connect_df_sample branch 2 times, most recently from 4c137ac to 1bd8f65 Compare October 21, 2022 02:07

zhengruifeng added 2 commits October 21, 2022 13:40

init

dc3e713

nit fix fix fix lint mark as todo mark as todo make seed a msg mv seed outside of sample mv seed outside of sample nit nit mv Seed into Sample fix scala lint change signature

resolve conflicts

25f4f75

zhengruifeng force-pushed the connect_df_sample branch from 1bd8f65 to 25f4f75 Compare October 21, 2022 05:41

HyukjinKwon approved these changes Oct 21, 2022

View reviewed changes

HyukjinKwon closed this in 7934f00 Oct 21, 2022

zhengruifeng deleted the connect_df_sample branch October 21, 2022 11:04

[SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample #38310

[SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample #38310

Uh oh!

Conversation

zhengruifeng commented Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Oct 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Oct 20, 2022

Uh oh!

HyukjinKwon commented Oct 21, 2022

Uh oh!

zhengruifeng commented Oct 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

[SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample` #38310

[SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample` #38310

zhengruifeng commented Oct 19, 2022 •

edited

Loading

cloud-fan Oct 19, 2022 •

edited

Loading

amaliujia Oct 19, 2022 •

edited

Loading

amaliujia Oct 19, 2022 •

edited

Loading

amaliujia Oct 19, 2022 •

edited

Loading