Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize aggregation #464

Merged
merged 10 commits into from
Dec 4, 2018
Merged

Conversation

paisible-wanderer
Copy link
Contributor

@paisible-wanderer paisible-wanderer commented Jul 18, 2018

Hello,

In the continuation of #454, I have made some improvements likely impacting all the aggregation related code (by making things "lazy").

Bellow are some aggregation benchmarks for different ratios of rows vs number of groups. (Summary: some things are ~10 times faster).

def gen_df(n_grs, nrows=3000)
  Daru::DataFrame.new({
    group_nb: (1..nrows).map { (1..n_grs).to_a.sample },
    value:    (1..nrows).map { rand },
  })
end

df_s = gen_df   20 # dataframe with a few groups (each having about 150 elements)
df_m = gen_df 2000
df_l = gen_df 4000 # almost every group has one row: lot of sub-vectors creation

# l for lambda
lsize = ->(sub_df) { sub_df[:value].size }
lsum  = ->(sub_df) { sub_df[:value].sum }

require 'benchmark'


# 1/ Benchmark comparing operation over vector and operation with lambda (over sub-dataframes)
# r is for "reference"
Benchmark.bm(8) do |b|
  b.report("small r")  { 100.times { df_s.group_by(:group_nb).aggregate({group_nb: :size, value: :sum}) } }
  b.report("medium r") { 100.times { df_m.group_by(:group_nb).aggregate({group_nb: :size, value: :sum}) } }
  b.report("large r")  { 100.times { df_l.group_by(:group_nb).aggregate({group_nb: :size, value: :sum}) } }

  b.report("small")    { 100.times { df_s.group_by(:group_nb).aggregate({a: lsize, b: lsum}) } }
  b.report("medium")   { 100.times { df_m.group_by(:group_nb).aggregate({a: lsize, b: lsum}) } }
  b.report("large")    { 100.times { df_l.group_by(:group_nb).aggregate({a: lsize, b: lsum}) } }
end

# 2/ Benchmark to see if we could use aggregation instead of the functions in group_by
# r is for "reference"
Benchmark.bm(8) do |b|
  b.report("small r")  { 300.times { df_s.group_by(:group_nb).sum } }
  b.report("small")    { 300.times { df_s.group_by(:group_nb).aggregate({value: :sum}) } }

  b.report("medium r") { 300.times { df_m.group_by(:group_nb).sum } }
  b.report("medium")   { 300.times { df_m.group_by(:group_nb).aggregate({value: :sum}) } }

  b.report("large r")  { 300.times { df_l.group_by(:group_nb).sum } }
  b.report("large")    { 300.times { df_l.group_by(:group_nb).aggregate({value: :sum}) } }
end


## before
# 1/
#                user     system      total        real
# small r   34.150000   0.040000  34.190000 ( 34.201153)
# medium r  67.450000   0.010000  67.460000 ( 67.408840)
# large r   64.300000   0.000000  64.300000 ( 64.294158)
# small     36.600000   0.090000  36.690000 ( 36.677260)
# medium    94.140000   0.000000  94.140000 ( 94.129418)
# large     95.100000   0.000000  95.100000 ( 95.096091)

# 2/
               user     system      total        real
# small r   19.460000   0.000000  19.460000 ( 19.464180)
# small     38.450000   0.150000  38.600000 ( 38.599517)
# medium r  30.440000   0.000000  30.440000 ( 30.432033)
# medium    84.610000   0.000000  84.610000 ( 84.609521)
# large r   33.190000   0.000000  33.190000 ( 33.188716)
# large     96.660000   0.000000  96.660000 ( 96.658534)


## intermediary stats (at commit "Avoids intermediary dataframe creation in Daru::Core::GroupBy#aggregate")
# 1/
               user     system      total        real
# small r    3.710000   0.000000   3.710000 (  3.710371)
# medium r  12.250000   0.000000  12.250000 ( 12.256285)
# large r   16.440000   0.000000  16.440000 ( 16.435032)
# small      5.280000   0.000000   5.280000 (  5.279285)
# medium    37.520000   0.010000  37.530000 ( 37.522483)
# large     48.070000   0.000000  48.070000 ( 48.067751)

# 2/
               user     system      total        real
# small r   10.630000   0.000000  10.630000 ( 10.625822)
# small      9.980000   0.000000   9.980000 (  9.984697)
# medium r  19.790000   0.000000  19.790000 ( 19.789653)
# medium    31.110000   0.000000  31.110000 ( 31.102751)
# large r   20.050000   0.000000  20.050000 ( 20.049377)
# large     41.930000   0.000000  41.930000 ( 41.924821)


## stats at final commit
# 1/
               user     system      total        real
# small r    2.200000   0.000000   2.200000 (  2.198592)
# medium r   6.540000   0.000000   6.540000 (  6.549165)
# large r    8.020000   0.000000   8.020000 (  8.017494)
# small      2.720000   0.000000   2.720000 (  2.717410)
# medium    13.550000   0.000000  13.550000 ( 13.548042)
# large     14.470000   0.000000  14.470000 ( 14.472992)

# 2/
               user     system      total        real
# small r    6.150000   0.000000   6.150000 (  6.148050)
# small      5.440000   0.000000   5.440000 (  5.433489)
# medium r  12.580000   0.000000  12.580000 ( 12.578441)
# medium    11.770000   0.000000  11.770000 ( 11.766527)
# large r   13.430000   0.000000  13.430000 ( 13.435111)
# large     13.970000   0.010000  13.980000 ( 13.973275)

=========================

Please, let me know if there is anything I can improve.

@paisible-wanderer
Copy link
Contributor Author

@zverok & @v0dro: ping!

@v0dro
Copy link
Member

v0dro commented Dec 4, 2018

@paisible-wanderer I'm extremely sorry for this huge delay. I'll look into this PR ASAP. Just give me a day more since the changes are many.

@v0dro
Copy link
Member

v0dro commented Dec 4, 2018

OK I just went through your changes and they're pretty amazing. I'm merging.
Thank you very much for your changes and hope to see more of these sometime again soon :)

And sorry again for the delay. The PR just went past my checks for some reason.

@v0dro v0dro merged commit 984ff72 into SciRuby:master Dec 4, 2018
@paisible-wanderer
Copy link
Contributor Author

@v0dro thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants