ARROW-11030: [Rust][DataFusion] Concatenate left side batches to single batch in HashJoinExec #9070

Dandandan · 2021-01-01T12:41:14Z

This PR removes the exponential slowdown based on the number of left/right batches in the hash join by concatenating all left batches to a single batch, so we can directly index into the array (with take) instead of building a structure / iterating all the left arrays. This also simplifies the code, which can help with further speed improvements and reducing memory usage.

The benchmark results look very promising, mostly removing the overhead of small batches and big datasets with many left-side batches. For a very small batch size of 64, on tcph q 12:

Query 12 iteration 0 took 1596.1 ms
Query 12 iteration 1 took 1602.6 ms
Query 12 iteration 2 took 1594.3 ms
Query 12 iteration 3 took 1610.6 ms
Query 12 iteration 4 took 1615.9 ms
Query 12 iteration 5 took 1619.8 ms
Query 12 iteration 6 took 1645.5 ms
Query 12 iteration 7 took 1653.7 ms
Query 12 iteration 8 took 1644.0 ms
Query 12 iteration 9 took 1641.9 ms
Query 12 avg time: 1622.43 ms

Master

Query 12 iteration 0 took 10701.6 ms
Query 12 iteration 1 took 10949.0 ms
Query 12 iteration 2 took 11556.2 ms
Query 12 iteration 3 took 11929.1 ms
Query 12 iteration 4 took 11850.4 ms
Query 12 iteration 5 took 11993.6 ms
Query 12 iteration 6 took 12132.9 ms
Query 12 iteration 7 took 12637.8 ms
Query 12 iteration 8 took 12561.6 ms
Query 12 iteration 9 took 12887.9 ms
Query 12 avg time: 11920.01 ms

The PR also includes improvements on query 1 / 5 (batch size 64k):

This PR:

Query 1 iteration 0 took 571.0 ms
Query 1 iteration 1 took 566.0 ms
Query 1 iteration 2 took 569.0 ms
Query 1 iteration 3 took 565.1 ms
Query 1 iteration 4 took 569.6 ms
Query 1 iteration 5 took 567.7 ms
Query 1 iteration 6 took 567.7 ms
Query 1 iteration 7 took 564.9 ms
Query 1 iteration 8 took 566.2 ms
Query 1 iteration 9 took 567.1 ms
Query 1 avg time: 567.44 ms

Master

Query 1 iteration 0 took 647.8 ms
Query 1 iteration 1 took 632.8 ms
Query 1 iteration 2 took 634.1 ms
Query 1 iteration 3 took 629.0 ms
Query 1 iteration 4 took 631.8 ms
Query 1 iteration 5 took 639.6 ms
Query 1 iteration 6 took 640.1 ms
Query 1 iteration 7 took 628.1 ms
Query 1 iteration 8 took 635.3 ms
Query 1 iteration 9 took 642.9 ms
Query 1 avg time: 636.16 ms

PR

Query 5 iteration 0 took 628.6 ms
Query 5 iteration 1 took 604.9 ms
Query 5 iteration 2 took 619.2 ms
Query 5 iteration 3 took 614.4 ms
Query 5 iteration 4 took 629.2 ms
Query 5 iteration 5 took 617.0 ms
Query 5 iteration 6 took 618.9 ms
Query 5 iteration 7 took 636.1 ms
Query 5 iteration 8 took 639.5 ms
Query 5 iteration 9 took 636.1 ms
Query 5 avg time: 624.39 ms

Master:

Query 5 iteration 0 took 662.2 ms
Query 5 iteration 1 took 715.7 ms
Query 5 iteration 2 took 671.4 ms
Query 5 iteration 3 took 664.9 ms
Query 5 iteration 4 took 645.6 ms
Query 5 iteration 5 took 655.9 ms
Query 5 iteration 6 took 657.7 ms
Query 5 iteration 7 took 634.3 ms
Query 5 iteration 8 took 658.3 ms
Query 5 iteration 9 took 665.6 ms
Query 5 avg time: 663.15 ms

github-actions · 2021-01-01T12:52:38Z

https://issues.apache.org/jira/browse/ARROW-11030

codecov-io · 2021-01-01T12:55:58Z

Codecov Report

Merging #9070 (33b96f3) into master (5228ede) will increase coverage by 0.01%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #9070      +/-   ##
==========================================
+ Coverage   82.59%   82.60%   +0.01%     
==========================================
  Files         204      204              
  Lines       50169    50197      +28     
==========================================
+ Hits        41436    41465      +29     
+ Misses       8733     8732       -1

Impacted Files	Coverage Δ
rust/datafusion/src/physical_plan/hash_join.rs	`86.08% <85.52%> (-0.24%)`	⬇️
...t/datafusion/src/physical_plan/coalesce_batches.rs	`88.23% <100.00%> (ø)`
rust/arrow/src/array/array_string.rs	`89.94% <0.00%> (-0.22%)`	⬇️
rust/arrow/src/array/transform/mod.rs	`89.01% <0.00%> (+0.72%)`	⬆️
rust/arrow/src/ffi.rs	`72.22% <0.00%> (+2.22%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5228ede...33b96f3. Read the comment docs.

Dandandan · 2021-01-01T13:00:58Z

FYI @jorgecarleitao @andygrove

andygrove · 2021-01-01T15:38:43Z

I'm seeing a 2x improvement in performance for q12 @ SF=100 with batch size 4096 🚀

This should allow us to reduce the default batch size and reduce memory pressure.

Dandandan · 2021-01-03T05:31:17Z

rust/datafusion/src/physical_plan/hash_join.rs

 use crate::error::{DataFusionError, Result};

 use super::{ExecutionPlan, Partitioning, RecordBatchStream, SendableRecordBatchStream};
+use crate::physical_plan::coalesce_batches::concat_batches;


Is there a place to put this?

It should probably go into the arrow crate since it isn't specific to DataFusion.

Dandandan · 2021-01-03T05:33:59Z

rust/datafusion/src/physical_plan/hash_join.rs


    let mut key = Vec::with_capacity(keys_values.len());

+    let mut left_indices = UInt64Builder::new(0);


Seems better for inner join to not use builder / allocate bitmap for nulls? What is the current most ergonomic way to do this?

Dandandan · 2021-01-03T05:37:03Z

rust/datafusion/src/physical_plan/hash_join.rs

+
+                    for _ in 0..indices.len() {
+                        // on an inner join, left and right indices are present
+                        right_indices.append_value(row as u32)?;


this could use something like append_n

rust/datafusion/src/physical_plan/hash_join.rs

Dandandan · 2021-01-03T05:48:52Z

rust/datafusion/src/physical_plan/hash_join.rs

+    RecordBatch::try_new(Arc::new(schema.clone()), columns)
 }

 /// Create a key `Vec<u8>` that is used as key for the hashmap


Currently looks like creating this key / hashing is most expensive part of the queries.

https://issues.apache.org/jira/browse/ARROW-11112

Dandandan · 2021-01-03T06:14:14Z

@andygrove @jorgecarleitao

Is now ready for review.
I added some comments myself too.

The PR now includes the following:

The main change to collect into 1 batch, and use take directly
Build info about column indices upfront
Speed improvement in create_key (doing array.value(i) twice most importantly), this also makes q1 and q5 run a bit faster
Misc refactoring

andygrove · 2021-01-03T18:55:13Z

rust/datafusion/src/physical_plan/hash_join.rs

 // E.g. [1, 2] -> [(0, 3), (1, 6), (0, 8)] indicates that (column1, column2) = [1, 2] is true
 // for rows 3 and 8 from batch 0 and row 6 from batch 1.
-type JoinHashMap = HashMap<Vec<u8>, Vec<Index>, RandomState>;
-type JoinLeftData = Arc<(JoinHashMap, Vec<RecordBatch>)>;


This change does mean that we can only use this join operator when the data can fit into a single batch. For example, if we are using 32-bit offsets then this limits us to 4GB of string data in a single column.

I don't know if we have any users that care about this scenario or not.

Yes, but as far as I can tell only limits the max nr of elements to 2 ^ 32 for each batch / array on the right?
I think the string data per right record batch itself should be able to hold more than 4GB because the string offsets are stored separately, or am I wrong about that?
If this becomes a problem in the future, we can change it to use u64 for the right indices as well, I think without too much extra cost besides double allocation/memory usage.

The Arrow spec supports both 32-bit and 64-bit variants for the offset vector, so in the 32-bit case, each batch would be limited to 2^32 bytes of variable-width data per array.

What I meant is that the types of the index used in the hash join has no effect of the type being used in the string array offsets. So the offsets values inside the string array might be using 64-bit values, while here we use 32-bit to index into the string elements, which would allow more than 4GB? Anyway, at some point there is a limit we should keep in mind 👍

Also FYI, for the left side we are now using 64 bit indices, so the 2^ 32 limit currently should be only on the number of elements per right batch, which seems reasonable to me (and easy to get rid of that constraint). Also, on the other hand, a future way to save some memory could be to choose for 32 bit indices for the left side whenever there are less than 2 ^ 32 items in the build side.

Ah I see what you mean now...

That would be a bit less trivial to get rid of than I thought, but at some point you could make a case either to convert those string arrays to use 64 bit indexes or to split the batches to have 2^32 size, or something like that.

Thanks for the explanation @andygrove 👍

FWIW I would love to see a join approach for larger datasets that doesn't require giant buffers (eg. 64 bit indexes) but a runtime switch over to using sort/merge join or its equivalent. But that is a pipe dream for me at the moment.

andygrove

This LGTM and I re-tested and still see a ~2x speedup for TPC-H q12 @ SF=100 with batch size 4096.

I added a comment noting the restriction when using 32-bit offsets. I think it would be reasonable to address this in the future if it becomes an issue for anyone.

jorgecarleitao · 2021-01-03T19:02:37Z

@andygrove , isn't this going in the direction that we were discussing in https://issues.apache.org/jira/browse/ARROW-11058 wrt to the current lack of benefits in having multiple batches (though you just mentioned a benefit: indexing size)?

andygrove · 2021-01-03T20:02:45Z

Yes @jorgecarleitao reducing the number of batches makes sense here. I am just slightly concerned that always reducing to a single batch may be introducing a constraint that could bite us in the future, although this is realistically only an issue when using 32-bit offsets for variable-width arrays.

Dandandan · 2021-01-05T11:15:28Z

@jorgecarleitao @andygrove
Do you think this design is good enough for now?
For now it means anyone storing > 4GB variable size data in a build-side column should opt to use 64-bit offsets, and in the future one option would be to automatically detect this and apply a cast when needed. I think that should have the least performance implications.

andygrove · 2021-01-05T15:08:44Z

@Dandandan Yes, I personally think this is fine for now (hence the approval) and since no-one has objected I think we can go ahead and merge this.

jorgecarleitao

Dandandan · 2021-01-06T11:24:29Z

FYI I am planning to on working on a the first changes https://issues.apache.org/jira/browse/ARROW-11112 basef on this PR/branch later this week, to start getting rid of the hash key creation and start prepare the algorithm to use a vectorized implementation.

alamb

Thanks @Dandandan !

alamb · 2021-01-07T14:00:38Z

rust/datafusion/src/physical_plan/hash_join.rs

 // E.g. [1, 2] -> [(0, 3), (1, 6), (0, 8)] indicates that (column1, column2) = [1, 2] is true
 // for rows 3 and 8 from batch 0 and row 6 from batch 1.
-type JoinHashMap = HashMap<Vec<u8>, Vec<Index>, RandomState>;
-type JoinLeftData = Arc<(JoinHashMap, Vec<RecordBatch>)>;


FWIW I would love to see a join approach for larger datasets that doesn't require giant buffers (eg. 64 bit indexes) but a runtime switch over to using sort/merge join or its equivalent. But that is a pipe dream for me at the moment.

Dandandan added 6 commits January 1, 2021 12:48

Calculate column indices upfront

19c84d7

Make function part of HashJoinExec

45fa819

Small refactor by using struct

baa5dce

Add comments

5b6bf83

Experiment with single batch

38d4651

Use index as offset

54b9c3a

github-actions bot added Component: Rust - DataFusion Component: Rust labels Jan 1, 2021

Dandandan changed the title ~~ARROW-11030: [Rust][DataFusion] [WIP] Concat to single batch~~ ARROW-11030: [Rust][DataFusion] [WIP] Concat to single batch in HashJoinExec Jan 1, 2021

Use rows

2db270c

Dandandan changed the title ~~ARROW-11030: [Rust][DataFusion] [WIP] Concat to single batch in HashJoinExec~~ ARROW-11030: [Rust][DataFusion] [WIP] Concatenate left side batches to single batch in HashJoinExec Jan 1, 2021

Refactor with single batch

5842c73

Refactor

1511f0e

This was referenced Jan 2, 2021

ARROW-11088: [Rust][DataFusion] Calculate column indices upfront in hash join #9059

Closed

ARROW-11086: [Rust] Extend take implementation to more index types #9057

Closed

Dandandan added 8 commits January 2, 2021 17:31

Small simplification

a976ace

Merge

bc5bdfa

Add comment

797ffb6

Small simplifiction

b865b8a

Merge

2418113

Reuse num rows

0efb5a7

Rename as offset

bbec070

Use u64 for left indices

6d182c9

Dandandan marked this pull request as ready for review January 3, 2021 04:42

Dandandan changed the title ~~ARROW-11030: [Rust][DataFusion] [WIP] Concatenate left side batches to single batch in HashJoinExec~~ ARROW-11030: [Rust][DataFusion] Concatenate left side batches to single batch in HashJoinExec Jan 3, 2021

Use appending by slice, small refactor

d58abf6

Style tweak

69d3286

Dandandan commented Jan 3, 2021

View reviewed changes

Style

c96ba9f

Dandandan commented Jan 3, 2021

View reviewed changes

rust/datafusion/src/physical_plan/hash_join.rs Show resolved Hide resolved

Dandandan commented Jan 3, 2021

View reviewed changes

Doc updates

33b96f3

andygrove reviewed Jan 3, 2021

View reviewed changes

andygrove approved these changes Jan 3, 2021

View reviewed changes

jorgecarleitao approved these changes Jan 5, 2021

View reviewed changes

Dandandan mentioned this pull request Jan 6, 2021

ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join #9116

Closed

alamb reviewed Jan 7, 2021

View reviewed changes

alamb closed this in 8896b42 Jan 7, 2021

asfimport mentioned this pull request Feb 24, 2021

[Rust] [DataFusion] HashJoinExec slow with many batches #26949

Closed


		let mut key = Vec::with_capacity(keys_values.len());

		let mut left_indices = UInt64Builder::new(0);

ARROW-11030: [Rust][DataFusion] Concatenate left side batches to single batch in HashJoinExec #9070

ARROW-11030: [Rust][DataFusion] Concatenate left side batches to single batch in HashJoinExec #9070

Uh oh!

Conversation

Dandandan commented Jan 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 1, 2021

Uh oh!

codecov-io commented Jan 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Dandandan commented Jan 1, 2021

Uh oh!

andygrove commented Jan 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jan 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jan 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jan 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jan 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jan 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao commented Jan 3, 2021

Uh oh!

andygrove commented Jan 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jan 5, 2021

Uh oh!

andygrove commented Jan 5, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jan 6, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Dandandan commented Jan 1, 2021 •

edited

Loading

codecov-io commented Jan 1, 2021 •

edited

Loading

Dandandan Jan 3, 2021 •

edited

Loading

Dandandan commented Jan 3, 2021 •

edited

Loading

Dandandan Jan 3, 2021 •

edited

Loading

Dandandan Jan 3, 2021 •

edited

Loading

Dandandan Jan 3, 2021 •

edited

Loading

andygrove commented Jan 3, 2021 •

edited

Loading