Skip to content

[DataFrame] Speed up dtypes#2118

Merged
devin-petersohn merged 2 commits intoray-project:masterfrom
pschafhalter:df-speed-up-dtypes
May 23, 2018
Merged

[DataFrame] Speed up dtypes#2118
devin-petersohn merged 2 commits intoray-project:masterfrom
pschafhalter:df-speed-up-dtypes

Conversation

@pschafhalter
Copy link
Contributor

Summary

Speeds up dtypes. In order to save memory, this PR gets smaller dataframes from the partitions which are merged together to find dtypes. This saves memory and offers performance benefits because it does not create a new version of _block_partitions with similar data.

This PR deprecates #2088.

Performance tests on 762 MB of data

Code (copied from iPython)

frame_data = np.random.randint(0,100,size=(10**6, 100))

%%time
df = pd.DataFrame(frame_data)
df.dropna(inplace=True)
repr(df)

On Master

CPU times: user 850 ms, sys: 1.1 s, total: 1.95 s
Wall time: 3.35 s

Number of objects in object table = 336
Total size of objects in object table in MB: 5456.237922668457

Master modified so dtypes doesn't access Ray/the object store

CPU times: user 851 ms, sys: 1.08 s, total: 1.93 s
Wall time: 3.01 s

Number of objects in object table = 256
Total size of objects in object table in MB: 4631.916297912598

Current PR

CPU times: user 910 ms, sys: 1.08 s, total: 1.99 s
Wall time: 3.13 s

Number of objects in object table = 393
Total size of objects in object table in MB: 4632.369149208069

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5565/
Test FAILed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If series are concatenated on axis=1, it returns a dataframe, which here would return a pandas DataFrame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this function since it's not needed for the new design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas has some useful library functions: pandas/core/dtypes/cast.py, which can coalesce and upcast dypes as needed. It might be useful to investigate these to see if we can use their rules directly to get dtypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further investigation, the overhead of df.loc[0:0] and concatenating these along columns is not much, and so might just be better to combine the remote tasks to work on column aligned blocks. And then do the ray.get and concat on the driver in the getter method, as discussed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will implement the second solution, as we discussed offline.

@pschafhalter
Copy link
Contributor Author

Redesigned speedup according to discussion with @kunalgosar. Also re-ran performance tests on a wider dataframe:

Code

frame_data = np.random.randint(0,100,size=(10000, 10000))

%%time
df = pd.DataFrame(frame_data)
df.dropna(inplace=True)
repr(df)

Master

CPU times: user 500 ms, sys: 826 ms, total: 1.33 s
Wall time: 2.14 s

Number of objects in object table = 336
Total size of objects in object table in MB: 5346.965839385986

Master modified so dtypes doesn't access Ray/the object store

Note: this returns an incorrect result for dtypes and should be the lower bound for this PR.

CPU times: user 433 ms, sys: 890 ms, total: 1.32 s
Wall time: 1.81 s

Number of objects in object table = 256
Total size of objects in object table in MB: 4582.445110321045

Previous commit

CPU times: user 562 ms, sys: 804 ms, total: 1.37 s
Wall time: 1.92 s

Number of objects in object table = 393
Total size of objects in object table in MB: 4584.221190452576

Current commit

CPU times: user 464 ms, sys: 846 ms, total: 1.31 s
Wall time: 1.82 s

Number of objects in object table = 264
Total size of objects in object table in MB: 4582.500843048096

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5569/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5571/
Test FAILed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor nits. This looks great!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stay consistent with the Ray formatting, can we have this be a single line description. Further description can be added two lines below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is _correct_dtypes the right name to use now? Would _get_remote_dtypes be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to use _find_dtypes since get implies that it performs a blocking ray.get

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: column to column_of_blocks. For readability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: column to column_of_blocks for clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another naming nit: _get_column_dtypes to _compile_remote_dtypes.

@pschafhalter pschafhalter self-assigned this May 22, 2018
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5579/
Test PASSed.

Further dtypes performance optimizations

Fix bugs

Redesign speedup

Address feedback
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5583/
Test FAILed.

@devin-petersohn
Copy link
Member

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5591/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5599/
Test PASSed.

@pschafhalter
Copy link
Contributor Author

All tests passed on private travis.

@devin-petersohn devin-petersohn merged commit 68b11c8 into ray-project:master May 23, 2018
@devin-petersohn
Copy link
Member

Merged, thanks @pschafhalter!

@pschafhalter pschafhalter deleted the df-speed-up-dtypes branch May 24, 2018 00:03
alok added a commit to alok/ray that referenced this pull request May 25, 2018
* master:
  Prototype named actors. (ray-project#2129)
  Update arrow to latest master (ray-project#2100)
  [DataFrame] Speed up dtypes (ray-project#2118)
  do not fetch from dead Plasma Manager (ray-project#2116)
  [DataFrame] Refactor GroupBy Methods and Implement Reindex (ray-project#2101)
  Initial Support for Airspeed Velocity (ray-project#2113)
  Use automatic memory management in Redis modules. (ray-project#1797)
  [DataFrame] Test bugfixes (ray-project#2111)
  [DataFrame] Update initializations of IndexMetadata which use outdated APIs (ray-project#2103)
alok added a commit to alok/ray that referenced this pull request May 30, 2018
* fix-a3c-torch:
  Prototype named actors. (ray-project#2129)
  Update arrow to latest master (ray-project#2100)
  [DataFrame] Speed up dtypes (ray-project#2118)
  do not fetch from dead Plasma Manager (ray-project#2116)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants