Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to avoid abstract classes introduced at #24965 but instead uses trait and object.

  • abstract class BaseArrowPythonRunner -> trait PythonArrowOutput to allow mix-in

    Before:

    BasePythonRunner
    ├── BaseArrowPythonRunner
    │   ├── ArrowPythonRunner
    │   └── CoGroupedArrowPythonRunner
    ├── PythonRunner
    └── PythonUDFRunner
    

    After:

    └── BasePythonRunner
        ├── ArrowPythonRunner
        ├── CoGroupedArrowPythonRunner
        ├── PythonRunner
        └── PythonUDFRunner
    
  • abstract class BasePandasGroupExec -> object PandasGroupUtils to decouple

    Before:

    └── BasePandasGroupExec
        ├── FlatMapGroupsInPandasExec
        └── FlatMapCoGroupsInPandasExec
    

    After:

    ├── FlatMapGroupsInPandasExec
    └── FlatMapCoGroupsInPandasExec
    

Why are the changes needed?

The problem is that R code path is being matched with Python side:

Python:

└── BasePythonRunner
    ├── ArrowPythonRunner
    ├── CoGroupedArrowPythonRunner
    ├── PythonRunner
    └── PythonUDFRunner

R:

└── BaseRRunner
    ├── ArrowRRunner
    └── RRunner

I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.

BasePandasGroupExec case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.

FlatMapGroupsInRWithArrowExec <> FlatMapGroupsInPandasExec
MapPartitionsInRWithArrowExec <> ArrowEvalPythonExec

In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Locally tested existing tests. Jenkins tests should verify this too.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 1, 2019

@d80tb7 and @BryanCutler, although I don't think this way is particularly better, I thought it's anyway better to let them separate.

Code lengths are virtually same. Do you guys like this way?

@SparkQA
Copy link

SparkQA commented Oct 1, 2019

Test build #111644 has finished for PR 25989 at commit 868cda2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@d80tb7
Copy link
Contributor

d80tb7 commented Oct 3, 2019

Hi @HyukjinKwon

It looks good to me- it's certainly no worse than what was there before and it meets the requirement of keeping the R and Python code paths aligned.

@HyukjinKwon
Copy link
Member Author

Thanks, @d80tb7

Merged to master.

@BryanCutler
Copy link
Member

@HyukjinKwon I think I slightly prefer the way it was before, but I haven't thought too much about aligning with R runner classes. I'm all for refactoring these to deduplicate and make it easier to manage, so if this is a step towards that, then it's fine. There are a couple things I have in mind for a redesign, so it would be good if we could discuss some before jumping in.

@HyukjinKwon
Copy link
Member Author

I dont plan to redesign it right now but wanted both just to be smilar as it was before for now. Sure, let's discuss when we do this.. I think we might have to think about this soon maybe after Spark 3.0 release.

@HyukjinKwon HyukjinKwon deleted the SPARK-29317 branch March 3, 2020 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants