Skip to content

Comments

[DataFrame] Refactor GroupBy Methods and Implement Reindex#2101

Merged
devin-petersohn merged 23 commits intoray-project:masterfrom
kunalgosar:groupby_methods
May 22, 2018
Merged

[DataFrame] Refactor GroupBy Methods and Implement Reindex#2101
devin-petersohn merged 23 commits intoray-project:masterfrom
kunalgosar:groupby_methods

Conversation

@kunalgosar
Copy link
Contributor

@kunalgosar kunalgosar commented May 19, 2018

Some of the changes in this PR are:

  • Creates a test suite for groupby methods
  • Fixes many of the groupby methods
  • Fixes the case where there is only one group after groupby
  • Fixes bugs where _block_partitions was 1D
  • Implements df.reindex
  • Fixes df.apply and df.agg

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5497/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5498/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5517/
Test PASSed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make _block_partitions a property and move the check to there.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5523/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5525/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to just before the DataFrameGroupBy object is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer axis=self._axis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use utils._reindex_helper to more efficiently reorder the columns/rows. Just make sure you reassign new_df.index or new_df.columns depending on the correct reassignment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for utils._reindex_helper

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove sort_index() from this file on checks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this next part to a utils function and call from within the _block_partitions property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't move it to a property because it depends on axis, but I have moved it to a utils function.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5529/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5553/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5558/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5560/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5566/
Test PASSed.



@ray.remote
def _deploy_generic_func(func, *args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we need this. I see how you're using it, but for now I would just prefer _deploy_func like everything else and pass in a row/column partition.

if index is not None:
old_index = self.index
new_blocks = np.array([_deploy_generic_func._submit(
args=(tuple([reindex_helper, old_index, index, 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the tuple([...] + block.tolist()) you can just do (...) + tuple(block.tolist()). I think it seems more clear this way.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5573/
Test PASSed.

@devin-petersohn devin-petersohn merged commit 4584193 into ray-project:master May 22, 2018
@devin-petersohn
Copy link
Member

Passes on private-travis. Thanks @kunalgosar!

@kunalgosar kunalgosar deleted the groupby_methods branch May 23, 2018 05:24
alok added a commit to alok/ray that referenced this pull request May 24, 2018
* master:
  [DataFrame] Refactor GroupBy Methods and Implement Reindex (ray-project#2101)
  Initial Support for Airspeed Velocity (ray-project#2113)
  Use automatic memory management in Redis modules. (ray-project#1797)
  [DataFrame] Test bugfixes (ray-project#2111)
  [DataFrame] Update initializations of IndexMetadata which use outdated APIs (ray-project#2103)
alok added a commit to alok/ray that referenced this pull request May 25, 2018
* master:
  Prototype named actors. (ray-project#2129)
  Update arrow to latest master (ray-project#2100)
  [DataFrame] Speed up dtypes (ray-project#2118)
  do not fetch from dead Plasma Manager (ray-project#2116)
  [DataFrame] Refactor GroupBy Methods and Implement Reindex (ray-project#2101)
  Initial Support for Airspeed Velocity (ray-project#2113)
  Use automatic memory management in Redis modules. (ray-project#1797)
  [DataFrame] Test bugfixes (ray-project#2111)
  [DataFrame] Update initializations of IndexMetadata which use outdated APIs (ray-project#2103)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants