[DataFrame] Speed up dtypes by pschafhalter · Pull Request #2118 · ray-project/ray

pschafhalter · 2018-05-22T07:33:47Z

Summary

Speeds up dtypes. In order to save memory, this PR gets smaller dataframes from the partitions which are merged together to find dtypes. This saves memory and offers performance benefits because it does not create a new version of _block_partitions with similar data.

This PR deprecates #2088.

Performance tests on 762 MB of data

Code (copied from iPython)

frame_data = np.random.randint(0,100,size=(10**6, 100))

%%time
df = pd.DataFrame(frame_data)
df.dropna(inplace=True)
repr(df)

On Master

CPU times: user 850 ms, sys: 1.1 s, total: 1.95 s
Wall time: 3.35 s

Number of objects in object table = 336
Total size of objects in object table in MB: 5456.237922668457

Master modified so dtypes doesn't access Ray/the object store

CPU times: user 851 ms, sys: 1.08 s, total: 1.93 s
Wall time: 3.01 s

Number of objects in object table = 256
Total size of objects in object table in MB: 4631.916297912598

Current PR

CPU times: user 910 ms, sys: 1.08 s, total: 1.99 s
Wall time: 3.13 s

Number of objects in object table = 393
Total size of objects in object table in MB: 4632.369149208069

AmplabJenkins · 2018-05-22T08:05:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5565/
Test FAILed.

kunalgosar · 2018-05-22T08:30:23Z

python/ray/dataframe/utils.py

If series are concatenated on axis=1, it returns a dataframe, which here would return a pandas DataFrame.

I'll remove this function since it's not needed for the new design.

kunalgosar · 2018-05-22T09:19:47Z

python/ray/dataframe/dataframe.py

Pandas has some useful library functions: pandas/core/dtypes/cast.py, which can coalesce and upcast dypes as needed. It might be useful to investigate these to see if we can use their rules directly to get dtypes.

On further investigation, the overhead of df.loc[0:0] and concatenating these along columns is not much, and so might just be better to combine the remote tasks to work on column aligned blocks. And then do the ray.get and concat on the driver in the getter method, as discussed.

Will implement the second solution, as we discussed offline.

pschafhalter · 2018-05-22T09:49:32Z

Redesigned speedup according to discussion with @kunalgosar. Also re-ran performance tests on a wider dataframe:

Code

frame_data = np.random.randint(0,100,size=(10000, 10000))

%%time
df = pd.DataFrame(frame_data)
df.dropna(inplace=True)
repr(df)

Master

CPU times: user 500 ms, sys: 826 ms, total: 1.33 s
Wall time: 2.14 s

Number of objects in object table = 336
Total size of objects in object table in MB: 5346.965839385986

Master modified so dtypes doesn't access Ray/the object store

Note: this returns an incorrect result for dtypes and should be the lower bound for this PR.

CPU times: user 433 ms, sys: 890 ms, total: 1.32 s
Wall time: 1.81 s

Number of objects in object table = 256
Total size of objects in object table in MB: 4582.445110321045

Previous commit

CPU times: user 562 ms, sys: 804 ms, total: 1.37 s
Wall time: 1.92 s

Number of objects in object table = 393
Total size of objects in object table in MB: 4584.221190452576

Current commit

CPU times: user 464 ms, sys: 846 ms, total: 1.31 s
Wall time: 1.82 s

Number of objects in object table = 264
Total size of objects in object table in MB: 4582.500843048096

AmplabJenkins · 2018-05-22T09:53:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5569/
Test PASSed.

AmplabJenkins · 2018-05-22T10:07:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5571/
Test FAILed.

devin-petersohn

Left a few minor nits. This looks great!

devin-petersohn · 2018-05-22T19:04:33Z

python/ray/dataframe/dataframe.py

To stay consistent with the Ray formatting, can we have this be a single line description. Further description can be added two lines below.

devin-petersohn · 2018-05-22T19:05:27Z

python/ray/dataframe/dataframe.py

Is _correct_dtypes the right name to use now? Would _get_remote_dtypes be better?

Might be better to use _find_dtypes since get implies that it performs a blocking ray.get

devin-petersohn · 2018-05-22T19:08:14Z

python/ray/dataframe/dataframe.py

Nit: column to column_of_blocks. For readability

devin-petersohn · 2018-05-22T19:08:31Z

python/ray/dataframe/utils.py

Nit: column to column_of_blocks for clarity.

Another naming nit: _get_column_dtypes to _compile_remote_dtypes.

AmplabJenkins · 2018-05-22T23:57:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5579/
Test PASSed.

Further dtypes performance optimizations Fix bugs Redesign speedup Address feedback

AmplabJenkins · 2018-05-23T00:36:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5583/
Test FAILed.

devin-petersohn · 2018-05-23T05:27:47Z

Jenkins, retest this please.

AmplabJenkins · 2018-05-23T06:35:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5591/
Test PASSed.

AmplabJenkins · 2018-05-23T23:10:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5599/
Test PASSed.

pschafhalter · 2018-05-23T23:34:02Z

All tests passed on private travis.

devin-petersohn · 2018-05-23T23:35:26Z

Merged, thanks @pschafhalter!

* master: Prototype named actors. (ray-project#2129) Update arrow to latest master (ray-project#2100) [DataFrame] Speed up dtypes (ray-project#2118) do not fetch from dead Plasma Manager (ray-project#2116) [DataFrame] Refactor GroupBy Methods and Implement Reindex (ray-project#2101) Initial Support for Airspeed Velocity (ray-project#2113) Use automatic memory management in Redis modules. (ray-project#1797) [DataFrame] Test bugfixes (ray-project#2111) [DataFrame] Update initializations of IndexMetadata which use outdated APIs (ray-project#2103)

* fix-a3c-torch: Prototype named actors. (ray-project#2129) Update arrow to latest master (ray-project#2100) [DataFrame] Speed up dtypes (ray-project#2118) do not fetch from dead Plasma Manager (ray-project#2116)

pschafhalter force-pushed the df-speed-up-dtypes branch from 631d5b8 to 7b1bd74 Compare May 22, 2018 08:37

kunalgosar suggested changes May 22, 2018

View reviewed changes

devin-petersohn reviewed May 22, 2018

View reviewed changes

pschafhalter self-assigned this May 22, 2018

Don't recreate _block_partitions in _correct_dtypes

22d1472

Further dtypes performance optimizations Fix bugs Redesign speedup Address feedback

pschafhalter force-pushed the df-speed-up-dtypes branch from a568ed8 to 22d1472 Compare May 23, 2018 00:09

Remove _correct_column_dtypes

3b45148

devin-petersohn approved these changes May 23, 2018

View reviewed changes

devin-petersohn merged commit 68b11c8 into ray-project:master May 23, 2018

pschafhalter mentioned this pull request May 23, 2018

[DataFrame] Fix bug with consistency between IndexMetadata and partitions #2088

Closed

pschafhalter deleted the df-speed-up-dtypes branch May 24, 2018 00:03

Conversation

pschafhalter commented May 22, 2018

Summary

Performance tests on 762 MB of data

Code (copied from iPython)

On Master

Master modified so dtypes doesn't access Ray/the object store

Current PR

Uh oh!

AmplabJenkins commented May 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pschafhalter commented May 22, 2018

Code

Master

Master modified so dtypes doesn't access Ray/the object store

Previous commit

Current commit

Uh oh!

AmplabJenkins commented May 22, 2018

Uh oh!

AmplabJenkins commented May 22, 2018

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 22, 2018

Uh oh!

AmplabJenkins commented May 23, 2018

Uh oh!

devin-petersohn commented May 23, 2018

Uh oh!

AmplabJenkins commented May 23, 2018

Uh oh!

AmplabJenkins commented May 23, 2018

Uh oh!

pschafhalter commented May 23, 2018

Uh oh!

devin-petersohn commented May 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants