Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add left, right, and outer joins #2166

Merged
merged 14 commits into from
May 7, 2024
Merged

[FEAT] Add left, right, and outer joins #2166

merged 14 commits into from
May 7, 2024

Conversation

kevinzwang
Copy link
Member

Currently does not support using sort-merge joins, but broadcast and hash joins will work fine

@github-actions github-actions bot added the enhancement New feature or request label Apr 21, 2024
@kevinzwang kevinzwang linked an issue Apr 21, 2024 that may be closed by this pull request
3 tasks
Copy link

codecov bot commented Apr 21, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (main@e47b48a). Click here to learn what that means.

❗ Current head 239743b differs from pull request most recent head 733b9da. Consider uploading reports for the commit 733b9da to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2166   +/-   ##
=======================================
  Coverage        ?   85.71%           
=======================================
  Files           ?       71           
  Lines           ?     7656           
  Branches        ?        0           
=======================================
  Hits            ?     6562           
  Misses          ?     1094           
  Partials        ?        0           
Files Coverage Δ
daft/dataframe/dataframe.py 90.38% <100.00%> (ø)
daft/logical/builder.py 92.96% <100.00%> (ø)
daft/table/micropartition.py 91.07% <ø> (ø)
daft/table/table.py 57.23% <0.00%> (ø)

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, but I'd suggest that @samster25 give it a once-over as the original hash join implementer!

daft/dataframe/dataframe.py Outdated Show resolved Hide resolved
src/daft-micropartition/src/ops/join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
tests/dataframe/test_joins.py Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Show resolved Hide resolved
src/daft-table/src/ops/joins/mod.rs Show resolved Hide resolved
src/daft-micropartition/src/ops/join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/mod.rs Show resolved Hide resolved
@kevinzwang
Copy link
Member Author

Benchmarks before:

============================================================================================================= test session starts ==============================================================================================================
platform darwin -- Python 3.9.6, pytest-7.4.3, pluggy-1.4.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/kevin/Desktop/Daft
configfile: pyproject.toml
plugins: cov-4.1.0, benchmark-4.0.0, lazy-fixture-0.6.3, hypothesis-6.79.2
collected 14 items                                                                                                                                                                                                                             

tests/benchmarks/test_join.py ..............                                                                                                                                                                                             [100%]
Saved benchmark data in: /Users/kevin/Desktop/Daft/.benchmarks/Darwin-CPython-3.9-64bit/0005_c4928f83013285afffed670ea5922f5bc6543bc1_20240501_225028.json



--------------------------------------------------------------------------------------------------------- benchmark 'joins': 14 tests ---------------------------------------------------------------------------------------------------------
Name (time in us)                                            Min                    Max                   Mean                StdDev                 Median                   IQR            Outliers         OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_broadcast_join[1part-right_bigger]                 586.7500 (1.0)         898.4170 (1.0)         629.7276 (1.0)         24.6707 (1.0)         623.9160 (1.0)         23.9685 (1.0)        197;50  1,587.9881 (1.0)        1389           1
test_broadcast_join[1part-left_bigger]                  738.2090 (1.26)        984.0420 (1.10)        791.5683 (1.26)        42.6634 (1.73)        777.5415 (1.25)        48.8330 (2.04)       213;47  1,263.3149 (0.80)        998           1
test_join_simple[10_000/1]                              969.8330 (1.65)      1,445.2080 (1.61)      1,023.6868 (1.63)        56.7362 (2.30)      1,006.9160 (1.61)        42.7088 (1.78)         15;9    976.8613 (0.62)        133           1
test_multicolumn_joins[num_columns:1-10_000/1]        1,013.0000 (1.73)      3,787.4590 (4.22)      1,082.8195 (1.72)       141.4162 (5.73)      1,068.5420 (1.71)        31.1250 (1.30)        16;57    923.5150 (0.58)        862           1
test_broadcast_join[10part-right_bigger]              1,122.5000 (1.91)      1,534.1250 (1.71)      1,209.2139 (1.92)        35.5787 (1.44)      1,205.8750 (1.93)        40.7510 (1.70)       192;17    826.9836 (0.52)        814           1
test_broadcast_join[10part-left_bigger]               1,128.2090 (1.92)      1,543.5000 (1.72)      1,209.3842 (1.92)        39.9587 (1.62)      1,206.0000 (1.93)        43.3648 (1.81)       178;26    826.8671 (0.52)        731           1
test_join_largekey[10_000/1]                          1,146.7080 (1.95)      5,135.6670 (5.72)      1,218.6918 (1.94)       199.7638 (8.10)      1,189.4580 (1.91)        45.0110 (1.88)        13;48    820.5520 (0.52)        665           1
test_multicolumn_joins[num_columns:4-10_000/1]        1,302.3330 (2.22)      4,416.2080 (4.92)      1,367.3827 (2.17)       138.2518 (5.60)      1,353.8960 (2.17)        34.8330 (1.45)        14;26    731.3241 (0.46)        590           1
test_join_withdata[10_000/1]                          4,158.2080 (7.09)      6,842.8330 (7.62)      4,587.5659 (7.28)       335.2298 (13.59)     4,496.1250 (7.21)       311.8130 (13.01)       37;13    217.9805 (0.14)        179           1
test_multicolumn_joins[num_columns:4-10_000/100]     33,388.1250 (56.90)    42,726.1250 (47.56)    40,095.6875 (63.67)    1,808.1958 (73.29)    40,259.1045 (64.53)    1,430.3125 (59.67)         5;1     24.9403 (0.02)         24           1
test_join_simple[10_000/100]                         35,137.7080 (59.89)    39,073.6250 (43.49)    36,277.6555 (57.61)      821.3016 (33.29)    36,139.3750 (57.92)      684.7500 (28.57)         5;3     27.5652 (0.02)         26           1
test_multicolumn_joins[num_columns:1-10_000/100]     36,375.6670 (62.00)    39,267.8340 (43.71)    38,187.6001 (60.64)      814.3948 (33.01)    38,205.0420 (61.23)    1,168.6565 (48.76)         9;0     26.1865 (0.02)         25           1
test_join_largekey[10_000/100]                       36,877.6670 (62.85)    58,845.7500 (65.50)    39,968.5866 (63.47)    4,081.1704 (165.43)   39,052.4375 (62.59)    1,618.2090 (67.51)         1;2     25.0196 (0.02)         26           1
test_join_withdata[10_000/100]                       84,291.0420 (143.66)   96,722.1670 (107.66)   88,027.0035 (139.79)   3,734.7850 (151.39)   87,173.0000 (139.72)   5,305.9380 (221.37)        3;0     11.3602 (0.01)         12           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================================================================================================= 14 passed in 14.29s ==============================================================================================================

Benchmarks after:

============================================================================================================== test session starts ==============================================================================================================
platform darwin -- Python 3.9.6, pytest-7.4.3, pluggy-1.4.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/kevin/Desktop/Daft
configfile: pyproject.toml
plugins: cov-4.1.0, benchmark-4.0.0, lazy-fixture-0.6.3, hypothesis-6.79.2
collected 14 items                                                                                                                                                                                                                              

tests/benchmarks/test_join.py ..............                                                                                                                                                                                              [100%]
Saved benchmark data in: /Users/kevin/Desktop/Daft/.benchmarks/Darwin-CPython-3.9-64bit/0002_6f64257c9eba6698b021d8931d34d10b565b2f96_20240501_223901.json



--------------------------------------------------------------------------------------------------------- benchmark 'joins': 14 tests ---------------------------------------------------------------------------------------------------------
Name (time in us)                                            Min                    Max                   Mean                StdDev                 Median                   IQR            Outliers         OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_broadcast_join[1part-right_bigger]                 576.6670 (1.0)         896.4590 (1.06)        621.0565 (1.00)        25.4630 (1.26)        614.6250 (1.00)        25.8847 (1.10)       277;55  1,610.1596 (1.00)       1553           1
test_broadcast_join[1part-left_bigger]                  587.1250 (1.02)        845.7090 (1.0)         619.1617 (1.0)         20.1897 (1.0)         614.3750 (1.0)         23.5942 (1.0)        294;30  1,615.0869 (1.0)        1273           1
test_join_simple[10_000/1]                              947.1250 (1.64)      1,482.8750 (1.75)      1,010.5261 (1.63)        70.7117 (3.50)        991.9170 (1.61)        50.7500 (2.15)          6;6    989.5835 (0.61)         86           1
test_multicolumn_joins[num_columns:1-10_000/1]          996.4580 (1.73)      3,720.8340 (4.40)      1,062.2019 (1.72)        99.0307 (4.91)      1,053.7910 (1.72)        37.7185 (1.60)        18;29    941.4406 (0.58)        913           1
test_broadcast_join[10part-left_bigger]               1,136.0410 (1.97)      4,444.3340 (5.26)      1,211.0799 (1.96)       125.2719 (6.20)      1,203.8330 (1.96)        39.9583 (1.69)         5;14    825.7094 (0.51)        719           1
test_broadcast_join[10part-right_bigger]              1,137.4160 (1.97)      5,191.1670 (6.14)      1,214.2293 (1.96)       147.0252 (7.28)      1,201.7500 (1.96)        40.8965 (1.73)        13;35    823.5677 (0.51)        809           1
test_join_largekey[10_000/1]                          1,143.9160 (1.98)      1,879.5000 (2.22)      1,200.7696 (1.94)        59.6644 (2.96)      1,188.5830 (1.93)        28.2910 (1.20)        24;32    832.7993 (0.52)        522           1
test_multicolumn_joins[num_columns:4-10_000/1]        1,296.0000 (2.25)      1,724.4170 (2.04)      1,364.0479 (2.20)        49.2095 (2.44)      1,353.7920 (2.20)        43.9165 (1.86)        87;31    733.1121 (0.45)        635           1
test_join_withdata[10_000/1]                          4,893.0000 (8.48)      6,991.7500 (8.27)      5,186.8806 (8.38)       273.6370 (13.55)     5,135.7080 (8.36)       152.5010 (6.46)          9;8    192.7941 (0.12)        134           1
test_join_simple[10_000/100]                         30,463.1660 (52.83)    34,003.9590 (40.21)    31,064.2670 (50.17)      616.1143 (30.52)    30,918.4790 (50.33)      227.4795 (9.64)          3;5     32.1913 (0.02)         32           1
test_multicolumn_joins[num_columns:1-10_000/100]     32,326.5830 (56.06)    35,734.4170 (42.25)    32,910.8133 (53.15)      603.6944 (29.90)    32,819.9170 (53.42)      371.1557 (15.73)         1;1     30.3852 (0.02)         29           1
test_join_largekey[10_000/100]                       32,777.9580 (56.84)    55,891.0830 (66.09)    33,899.7499 (54.75)    4,161.2077 (206.11)   33,078.2505 (53.84)      359.4170 (15.23)         1;1     29.4987 (0.02)         30           1
test_multicolumn_joins[num_columns:4-10_000/100]     33,097.0420 (57.39)    40,802.7910 (48.25)    34,758.1919 (56.14)    1,694.3839 (83.92)    34,274.4585 (55.79)    1,072.4380 (45.45)         2;2     28.7702 (0.02)         28           1
test_join_withdata[10_000/100]                       71,503.7090 (123.99)   77,277.2500 (91.38)    74,892.5804 (120.96)   1,729.8860 (85.68)    74,773.5210 (121.71)   3,010.9590 (127.61)        5;0     13.3525 (0.01)         14           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================================================================================================== 14 passed in 14.11s ==============================================================================================================```

@samster25
Copy link
Member

👀

src/daft-micropartition/src/ops/join.rs Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/hash_join.rs Outdated Show resolved Hide resolved
fn add_non_join_key_columns(
left: &Table,
right: &Table,
lidx: Series,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not take these as &Series

Copy link
Member Author

@kevinzwang kevinzwang May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move lidx so that I can drop it early to save memory. ridx doesn't have to be but since it won't be used after, I figured I would just move it as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense

src/daft-table/src/ops/joins/mod.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/mod.rs Outdated Show resolved Hide resolved
src/daft-table/src/ops/joins/mod.rs Outdated Show resolved Hide resolved
@kevinzwang kevinzwang requested a review from samster25 May 7, 2024 21:21
Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@kevinzwang kevinzwang enabled auto-merge (squash) May 7, 2024 22:01
@kevinzwang kevinzwang merged commit 89e3916 into main May 7, 2024
27 checks passed
@kevinzwang kevinzwang deleted the kevin/more-joins branch May 7, 2024 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Left/Right/Outer joins
3 participants