[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan #25989

HyukjinKwon · 2019-10-01T13:23:57Z

What changes were proposed in this pull request?

This PR proposes to avoid abstract classes introduced at #24965 but instead uses trait and object.

abstract class BaseArrowPythonRunner -> trait PythonArrowOutput to allow mix-in

Before:

BasePythonRunner
├── BaseArrowPythonRunner
│   ├── ArrowPythonRunner
│   └── CoGroupedArrowPythonRunner
├── PythonRunner
└── PythonUDFRunner

After:

└── BasePythonRunner
    ├── ArrowPythonRunner
    ├── CoGroupedArrowPythonRunner
    ├── PythonRunner
    └── PythonUDFRunner

abstract class BasePandasGroupExec -> object PandasGroupUtils to decouple

Before:

└── BasePandasGroupExec
    ├── FlatMapGroupsInPandasExec
    └── FlatMapCoGroupsInPandasExec

After:

├── FlatMapGroupsInPandasExec
└── FlatMapCoGroupsInPandasExec

Why are the changes needed?

The problem is that R code path is being matched with Python side:

Python:

└── BasePythonRunner
    ├── ArrowPythonRunner
    ├── CoGroupedArrowPythonRunner
    ├── PythonRunner
    └── PythonUDFRunner

R:

└── BaseRRunner
    ├── ArrowRRunner
    └── RRunner

I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.

BasePandasGroupExec case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.

FlatMapGroupsInRWithArrowExec <> FlatMapGroupsInPandasExec
MapPartitionsInRWithArrowExec <> ArrowEvalPythonExec

In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Locally tested existing tests. Jenkins tests should verify this too.

HyukjinKwon · 2019-10-01T13:25:45Z

@d80tb7 and @BryanCutler, although I don't think this way is particularly better, I thought it's anyway better to let them separate.

Code lengths are virtually same. Do you guys like this way?

SparkQA · 2019-10-01T17:01:23Z

Test build #111644 has finished for PR 25989 at commit 868cda2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

d80tb7 · 2019-10-03T06:16:35Z

Hi @HyukjinKwon

It looks good to me- it's certainly no worse than what was there before and it meets the requirement of keeping the R and Python code paths aligned.

HyukjinKwon · 2019-10-03T07:33:30Z

Thanks, @d80tb7

Merged to master.

BryanCutler · 2019-10-03T18:23:13Z

@HyukjinKwon I think I slightly prefer the way it was before, but I haven't thought too much about aligning with R runner classes. I'm all for refactoring these to deduplicate and make it easier to manage, so if this is a step towards that, then it's fine. There are a couple things I have in mind for a redesign, so it would be good if we could discuss some before jumping in.

HyukjinKwon · 2019-10-03T22:44:26Z

I dont plan to redesign it right now but wanted both just to be smilar as it was before for now. Sure, let's discuss when we do this.. I think we might have to think about this soon maybe after Spark 3.0 release.

Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan

868cda2

HyukjinKwon mentioned this pull request Oct 1, 2019

[WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs #24965

Closed

dongjoon-hyun added PYSPARK SQL labels Oct 1, 2019

HyukjinKwon closed this in 40485f4 Oct 3, 2019

HyukjinKwon deleted the SPARK-29317 branch March 3, 2020 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan #25989

[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan #25989

Uh oh!

HyukjinKwon commented Oct 1, 2019

Uh oh!

HyukjinKwon commented Oct 1, 2019 •

edited

Loading

Uh oh!

SparkQA commented Oct 1, 2019

Uh oh!

d80tb7 commented Oct 3, 2019

Uh oh!

HyukjinKwon commented Oct 3, 2019

Uh oh!

BryanCutler commented Oct 3, 2019

Uh oh!

HyukjinKwon commented Oct 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan #25989

[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan #25989

Uh oh!

Conversation

HyukjinKwon commented Oct 1, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Oct 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 1, 2019

Uh oh!

d80tb7 commented Oct 3, 2019

Uh oh!

HyukjinKwon commented Oct 3, 2019

Uh oh!

BryanCutler commented Oct 3, 2019

Uh oh!

HyukjinKwon commented Oct 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Oct 1, 2019 •

edited

Loading