Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance #1631

Closed
Tracked by #721
qiranq99 opened this issue Dec 5, 2023 · 8 comments · Fixed by #1646
Closed
Tracked by #721

Poor performance #1631

qiranq99 opened this issue Dec 5, 2023 · 8 comments · Fixed by #1646
Assignees
Labels
component:python Issues about vineyard's python SDK performance Issues that related to the performance of vineyardd and vineyard SDKs. priority:high

Comments

@qiranq99
Copy link

qiranq99 commented Dec 5, 2023

Hi,

Under several benchmarks of putting data object to the shared store (from 1KB to several GBs), we observed that vineyard underperforms ray (plasma), for spending 2x-5x more time

With the data object size getting larger, the performance issue scales. Are there any specific reasons or sources of overhead?

@sighingnow
Copy link
Member

Which kinds of objects you are working with? Can we know the dtypes/schema if you are working with pandas dataframe/tables.

There's known performance degeneration if your pandas dataframe has string columns.

@sighingnow sighingnow added performance Issues that related to the performance of vineyardd and vineyard SDKs. component:python Issues about vineyard's python SDK priority:high labels Dec 5, 2023
@qiranq99
Copy link
Author

qiranq99 commented Dec 5, 2023

Hi @sighingnow,

basically numpy arrays and torch.tensor are tested. The performance is not satisfying in both cases, whether small or large data objects.

You could try out putting a np.random.rand(512,1024,1) array (4MB), and compare vineyard.put() with ray.put(). On our machine, the performance is barely 1/4 against ray.

@sighingnow
Copy link
Member

Thanks for the information. We'll take a try to verify the result.

@qiranq99
Copy link
Author

qiranq99 commented Dec 5, 2023

To clarify, the measurements mentioned above are not valid in most cases. The amended benchmarks shows 1.2 ~ 1.5 performance degradation compared with ray, especially for large data objects (several hundreds of MBs or several GBs)

@qiranq99
Copy link
Author

qiranq99 commented Dec 7, 2023

The underlying reasons for the observed performance gap are:

  1. plasma and ray enable threadPool for single process, while vineyard seems to be using single thread;
  2. if we scale num_workers, i.e., the number of processes, the performance gap would be eliminated.

@sighingnow Please verify the first statement, if so, this channel could be closed.

@dashanji
Copy link
Member

dashanji commented Dec 8, 2023

You could try out putting a np.random.rand(512,1024,1) array (4MB), and compare vineyard.put() with ray.put(). On our machine, the performance is barely 1/4 against ray.

I have tested the code and done some profiling, the vineyard.put is mainly spent on the memcpy (copy to the shared memory). My guess is that ray use cython and vineyard use pybind via pybind/pybind11#1227

@sighingnow
Copy link
Member

sighingnow commented Dec 13, 2023

Hi @qiranq99,

Actually, the performance gap has nothing to do with the Cython/pybind11 calls. The gap is because plasma internally use multiple threads for concurrent memcpying (by default is 6, see also: https://github.com/apache/arrow/blob/apache-arrow-11.0.0/python/pyarrow/_plasma.pyx#L532) while vineyard uses single thread for memcpy.

After enabling concurrent memcpy, vineyard archives even higher throughput than plasma at the same level of parallelism when putting numpy ndarrays:

---------------------------------------------------------------------------------------------------------------- benchmark: 10 tests ----------------------------------------------------------------------------------------------------------------
Name (time in us)                                            Min                     Max                    Mean                 StdDev                  Median                    IQR            Outliers          OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_bench_numpy_ndarray[plasma_client-256]              77.8432 (1.0)          141.5438 (1.0)           87.8491 (1.0)           7.3290 (1.0)           85.9904 (1.0)           5.7258 (1.0)        165;99  11,383.1537 (1.0)        1256           1
test_bench_numpy_ndarray[plasma_client-256KB]           110.6099 (1.42)       1,562.4296 (11.04)        132.2550 (1.51)         36.4446 (4.97)         124.1728 (1.44)         17.9743 (3.14)      159;228   7,561.1512 (0.66)       2503           1
test_bench_numpy_ndarray[vineyard_client-256]           256.9887 (3.30)       1,944.8148 (13.74)        381.2453 (4.34)         82.1342 (11.21)        386.8388 (4.50)         93.9578 (16.41)      194;11   2,622.9829 (0.23)       1115           1
test_bench_numpy_ndarray[vineyard_client-256KB]         280.0189 (3.60)       3,346.4008 (23.64)        394.1826 (4.49)        102.5968 (14.00)        383.9559 (4.47)         93.7111 (16.37)       109;8   2,536.8954 (0.22)       1365           1
test_bench_numpy_ndarray[vineyard_client-256MB]      10,421.3501 (133.88)    35,830.6342 (253.14)    12,618.7254 (143.64)    4,998.5722 (682.03)    10,794.7108 (125.53)    2,562.6968 (447.57)        1;1      79.2473 (0.01)         25           1
test_bench_numpy_ndarray[plasma_client-256MB]        11,778.7081 (151.31)    14,720.6802 (104.00)    12,230.5463 (139.22)      744.5806 (101.59)    11,988.2962 (139.41)      318.0359 (55.54)         2;2      81.7625 (0.01)         23           1
test_bench_numpy_ndarray[vineyard_client-1GB]        39,629.0324 (509.09)    51,090.4198 (360.95)    46,283.8365 (526.86)    4,987.9626 (680.58)    49,023.0583 (570.10)    9,908.4431 (>1000.0)       3;0      21.6058 (0.00)          9           1
test_bench_numpy_ndarray[plasma_client-1GB]          46,443.7758 (596.63)    58,577.3201 (413.85)    48,716.5607 (554.55)    4,372.6252 (596.62)    47,076.5112 (547.46)      954.7628 (166.75)        1;1      20.5269 (0.00)          7           1
test_bench_numpy_ndarray[vineyard_client-4GB]       149,866.7547 (>1000.0)  153,821.7869 (>1000.0)  152,392.1663 (>1000.0)   1,538.5827 (209.93)   152,710.1910 (>1000.0)   1,839.8198 (321.32)        1;0       6.5620 (0.00)          5           1
test_bench_numpy_ndarray[plasma_client-4GB]         163,773.4137 (>1000.0)  206,139.4518 (>1000.0)  182,380.0197 (>1000.0)  15,748.3002 (>1000.0)  182,238.1048 (>1000.0)  19,385.6989 (>1000.0)       2;0       5.4831 (0.00)          5           1

The benchmark case and newly added concurrency control in Python APIs can be found at #1646.

From the result, you can see there are indeed improvements compared with plasma when putting large tensors. For small tensors, the gap is because there are still opportunities to further improving the dispatch logic of builders and resolvers. Compared with plasma, vineyard unlocks the opportunities for more complex objects as well as object impossibilities.

The optimization of builders and resolvers is already in our roadmap (issue #727).

@sighingnow
Copy link
Member

sighingnow commented Dec 13, 2023

The concurrent memcpy is only enabled for copies >= 4MB to optimize the overhead of creating threads.

sighingnow added a commit that referenced this issue Dec 13, 2023
…#1646)

Remove the problematic `.buffer` property (as it cannot bind the
lifetime of the underlying blob to the memoryview object) and add
concurrent support for memcpy for faster object building.

Fixes #1631

Signed-off-by: Tao He <[email protected]>
@sighingnow sighingnow self-assigned this Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:python Issues about vineyard's python SDK performance Issues that related to the performance of vineyardd and vineyard SDKs. priority:high
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants