[DataFrame] Impement sort_values and sort_index #1977

devin-petersohn · 2018-05-01T22:45:57Z

Implements sort_values and sort_index.

TODO:

Better error checking
Sanity tests

AmplabJenkins · 2018-05-01T23:17:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5135/
Test FAILed.

AmplabJenkins · 2018-05-02T05:36:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5143/
Test FAILed.

devin-petersohn · 2018-05-03T03:50:37Z

Jenkins, retest this please

AmplabJenkins · 2018-05-03T04:55:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5155/
Test PASSed.

kunalgosar

This looks really good. For sort_values you might be able to just sort the data on the driver and reindex on the new sorted index.

kunalgosar · 2018-05-04T10:12:33Z

python/ray/dataframe/dataframe.py

+            else:
+                df.columns = index
+
+            return df.sort_index(*args)


The index should be reset to a RangeIndex after this operation

kunalgosar · 2018-05-04T10:13:22Z

python/ray/dataframe/dataframe.py

+            return df.sort_index(*args)
+
+        if axis == 0:
+            index = (self.index)


Why is this in parenthesis?

kunalgosar · 2018-05-04T10:29:52Z

python/ray/dataframe/dataframe.py

+                broadcast_values.columns = df.columns
+                names = broadcast_values.index
+
+            return pd.concat([df, broadcast_values], axis=axis ^ 1,


Can the broadcast_values be sorted alone (which is done anyways below) and then the new index be used to reindex each of the partitions?

Unfortunately that is not always faster.

When it is slower it can be up to 3x slower, so to avoid that worst case we will leave it like this for now.

kunalgosar

Looks really good. A few more comments.

kunalgosar · 2018-05-06T04:17:45Z

python/ray/dataframe/test/test_dataframe.py

+    ray_df_equals_pandas(ray_result, pandas_result)


 def test_sort_values():


Use a pytest fixture here, other tests can benefit from the large random dataframe being constructed.

At some point I think this should happen, but this probably isn't the PR to go through and make all these changes to the tests.

Okay, this can be done in the future - separately.

kunalgosar · 2018-05-06T04:18:16Z

python/ray/dataframe/test/test_dataframe.py


 def test_sort_index():
-    ray_df = create_test_dataframe()
+    pandas_df = pd.DataFrame(np.random.randint(0, 100, size=(1000, 100)))


NIT: Share data between the two tests.

kunalgosar · 2018-05-06T04:19:56Z

python/ray/dataframe/dataframe.py

+                row.index = [str(idx)]
+
+            # Put this here to match the by below.
+            by = [str(col) for col in by]


This code is duplicated below.

kunalgosar · 2018-05-06T04:29:12Z

python/ray/dataframe/dataframe.py

+            by = [by]
+
+        if axis == 0:
+            broadcast_value_dict = {str(col): self[col] for col in by}


Can this be done as a single getitem call? You can pass in a list of column names.

It returns a ray DataFrame, so we'd have to to_pandas it. It's slower overall to build that DataFrame than getitem multiple times.

AmplabJenkins · 2018-05-06T04:45:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5233/
Test PASSed.

AmplabJenkins · 2018-05-06T06:30:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5235/
Test PASSed.

* master: (21 commits) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) add pthread linking (ray-project#1986) [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984) ...

* master: (25 commits) [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) ...

* master: [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973)

devin-petersohn added 3 commits May 1, 2018 15:19

Start sort implementation

e5289c5

Working on axis=1 for sort

bece269

Fixing sort implementation

02e6edf

Add tests and fix bug

7e8a9e3

kunalgosar suggested changes May 4, 2018

View reviewed changes

Addressing comments

a26b010

kunalgosar approved these changes May 6, 2018

View reviewed changes

Removing duplicate code

3bcad98

robertnishihara approved these changes May 6, 2018

View reviewed changes

robertnishihara merged commit ad1afeb into ray-project:master May 6, 2018

		ray_df_equals_pandas(ray_result, pandas_result)


		def test_sort_values():

[DataFrame] Impement sort_values and sort_index #1977

[DataFrame] Impement sort_values and sort_index #1977

Uh oh!

Conversation

devin-petersohn commented May 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented May 1, 2018

Uh oh!

AmplabJenkins commented May 2, 2018

Uh oh!

devin-petersohn commented May 3, 2018

Uh oh!

AmplabJenkins commented May 3, 2018

Uh oh!

kunalgosar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalgosar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 6, 2018

Uh oh!

AmplabJenkins commented May 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

devin-petersohn commented May 1, 2018 •

edited

Loading