[DataFrame] Speed up dtypes#2118
Conversation
|
Test FAILed. |
631d5b8 to
7b1bd74
Compare
python/ray/dataframe/utils.py
Outdated
There was a problem hiding this comment.
If series are concatenated on axis=1, it returns a dataframe, which here would return a pandas DataFrame.
There was a problem hiding this comment.
I'll remove this function since it's not needed for the new design.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Pandas has some useful library functions: pandas/core/dtypes/cast.py, which can coalesce and upcast dypes as needed. It might be useful to investigate these to see if we can use their rules directly to get dtypes.
There was a problem hiding this comment.
On further investigation, the overhead of df.loc[0:0] and concatenating these along columns is not much, and so might just be better to combine the remote tasks to work on column aligned blocks. And then do the ray.get and concat on the driver in the getter method, as discussed.
There was a problem hiding this comment.
Will implement the second solution, as we discussed offline.
|
Redesigned speedup according to discussion with @kunalgosar. Also re-ran performance tests on a wider dataframe: Codeframe_data = np.random.randint(0,100,size=(10000, 10000))
%%time
df = pd.DataFrame(frame_data)
df.dropna(inplace=True)
repr(df)MasterCPU times: user 500 ms, sys: 826 ms, total: 1.33 s Number of objects in object table = 336 Master modified so dtypes doesn't access Ray/the object storeNote: this returns an incorrect result for dtypes and should be the lower bound for this PR. CPU times: user 433 ms, sys: 890 ms, total: 1.32 s Number of objects in object table = 256 Previous commitCPU times: user 562 ms, sys: 804 ms, total: 1.37 s Number of objects in object table = 393 Current commitCPU times: user 464 ms, sys: 846 ms, total: 1.31 s Number of objects in object table = 264 |
|
Test PASSed. |
|
Test FAILed. |
devin-petersohn
left a comment
There was a problem hiding this comment.
Left a few minor nits. This looks great!
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
To stay consistent with the Ray formatting, can we have this be a single line description. Further description can be added two lines below.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Is _correct_dtypes the right name to use now? Would _get_remote_dtypes be better?
There was a problem hiding this comment.
Might be better to use _find_dtypes since get implies that it performs a blocking ray.get
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Nit: column to column_of_blocks. For readability
python/ray/dataframe/utils.py
Outdated
There was a problem hiding this comment.
Nit: column to column_of_blocks for clarity.
There was a problem hiding this comment.
Another naming nit: _get_column_dtypes to _compile_remote_dtypes.
|
Test PASSed. |
Further dtypes performance optimizations Fix bugs Redesign speedup Address feedback
a568ed8 to
22d1472
Compare
|
Test FAILed. |
|
Jenkins, retest this please. |
|
Test PASSed. |
|
Test PASSed. |
|
All tests passed on private travis. |
|
Merged, thanks @pschafhalter! |
* master: Prototype named actors. (ray-project#2129) Update arrow to latest master (ray-project#2100) [DataFrame] Speed up dtypes (ray-project#2118) do not fetch from dead Plasma Manager (ray-project#2116) [DataFrame] Refactor GroupBy Methods and Implement Reindex (ray-project#2101) Initial Support for Airspeed Velocity (ray-project#2113) Use automatic memory management in Redis modules. (ray-project#1797) [DataFrame] Test bugfixes (ray-project#2111) [DataFrame] Update initializations of IndexMetadata which use outdated APIs (ray-project#2103)
* fix-a3c-torch: Prototype named actors. (ray-project#2129) Update arrow to latest master (ray-project#2100) [DataFrame] Speed up dtypes (ray-project#2118) do not fetch from dead Plasma Manager (ray-project#2116)
Summary
Speeds up dtypes. In order to save memory, this PR gets smaller dataframes from the partitions which are merged together to find
dtypes. This saves memory and offers performance benefits because it does not create a new version of_block_partitionswith similar data.This PR deprecates #2088.
Performance tests on 762 MB of data
Code (copied from iPython)
On Master
CPU times: user 850 ms, sys: 1.1 s, total: 1.95 s
Wall time: 3.35 s
Number of objects in object table = 336
Total size of objects in object table in MB: 5456.237922668457
Master modified so dtypes doesn't access Ray/the object store
CPU times: user 851 ms, sys: 1.08 s, total: 1.93 s
Wall time: 3.01 s
Number of objects in object table = 256
Total size of objects in object table in MB: 4631.916297912598
Current PR
CPU times: user 910 ms, sys: 1.08 s, total: 1.99 s
Wall time: 3.13 s
Number of objects in object table = 393
Total size of objects in object table in MB: 4632.369149208069