Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jan 1, 2021

This PR removes the exponential slowdown based on the number of left/right batches in the hash join by concatenating all left batches to a single batch, so we can directly index into the array (with take) instead of building a structure / iterating all the left arrays. This also simplifies the code, which can help with further speed improvements and reducing memory usage.

The benchmark results look very promising, mostly removing the overhead of small batches and big datasets with many left-side batches. For a very small batch size of 64, on tcph q 12:

Query 12 iteration 0 took 1596.1 ms
Query 12 iteration 1 took 1602.6 ms
Query 12 iteration 2 took 1594.3 ms
Query 12 iteration 3 took 1610.6 ms
Query 12 iteration 4 took 1615.9 ms
Query 12 iteration 5 took 1619.8 ms
Query 12 iteration 6 took 1645.5 ms
Query 12 iteration 7 took 1653.7 ms
Query 12 iteration 8 took 1644.0 ms
Query 12 iteration 9 took 1641.9 ms
Query 12 avg time: 1622.43 ms

Master

Query 12 iteration 0 took 10701.6 ms
Query 12 iteration 1 took 10949.0 ms
Query 12 iteration 2 took 11556.2 ms
Query 12 iteration 3 took 11929.1 ms
Query 12 iteration 4 took 11850.4 ms
Query 12 iteration 5 took 11993.6 ms
Query 12 iteration 6 took 12132.9 ms
Query 12 iteration 7 took 12637.8 ms
Query 12 iteration 8 took 12561.6 ms
Query 12 iteration 9 took 12887.9 ms
Query 12 avg time: 11920.01 ms

The PR also includes improvements on query 1 / 5 (batch size 64k):

This PR:

Query 1 iteration 0 took 571.0 ms
Query 1 iteration 1 took 566.0 ms
Query 1 iteration 2 took 569.0 ms
Query 1 iteration 3 took 565.1 ms
Query 1 iteration 4 took 569.6 ms
Query 1 iteration 5 took 567.7 ms
Query 1 iteration 6 took 567.7 ms
Query 1 iteration 7 took 564.9 ms
Query 1 iteration 8 took 566.2 ms
Query 1 iteration 9 took 567.1 ms
Query 1 avg time: 567.44 ms

Master

Query 1 iteration 0 took 647.8 ms
Query 1 iteration 1 took 632.8 ms
Query 1 iteration 2 took 634.1 ms
Query 1 iteration 3 took 629.0 ms
Query 1 iteration 4 took 631.8 ms
Query 1 iteration 5 took 639.6 ms
Query 1 iteration 6 took 640.1 ms
Query 1 iteration 7 took 628.1 ms
Query 1 iteration 8 took 635.3 ms
Query 1 iteration 9 took 642.9 ms
Query 1 avg time: 636.16 ms

PR

Query 5 iteration 0 took 628.6 ms
Query 5 iteration 1 took 604.9 ms
Query 5 iteration 2 took 619.2 ms
Query 5 iteration 3 took 614.4 ms
Query 5 iteration 4 took 629.2 ms
Query 5 iteration 5 took 617.0 ms
Query 5 iteration 6 took 618.9 ms
Query 5 iteration 7 took 636.1 ms
Query 5 iteration 8 took 639.5 ms
Query 5 iteration 9 took 636.1 ms
Query 5 avg time: 624.39 ms

Master:

Query 5 iteration 0 took 662.2 ms
Query 5 iteration 1 took 715.7 ms
Query 5 iteration 2 took 671.4 ms
Query 5 iteration 3 took 664.9 ms
Query 5 iteration 4 took 645.6 ms
Query 5 iteration 5 took 655.9 ms
Query 5 iteration 6 took 657.7 ms
Query 5 iteration 7 took 634.3 ms
Query 5 iteration 8 took 658.3 ms
Query 5 iteration 9 took 665.6 ms
Query 5 avg time: 663.15 ms

@Dandandan Dandandan changed the title ARROW-11030: [Rust][DataFusion] [WIP] Concat to single batch ARROW-11030: [Rust][DataFusion] [WIP] Concat to single batch in HashJoinExec Jan 1, 2021
@github-actions
Copy link

github-actions bot commented Jan 1, 2021

@codecov-io
Copy link

codecov-io commented Jan 1, 2021

Codecov Report

Merging #9070 (33b96f3) into master (5228ede) will increase coverage by 0.01%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9070      +/-   ##
==========================================
+ Coverage   82.59%   82.60%   +0.01%     
==========================================
  Files         204      204              
  Lines       50169    50197      +28     
==========================================
+ Hits        41436    41465      +29     
+ Misses       8733     8732       -1     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/hash_join.rs 86.08% <85.52%> (-0.24%) ⬇️
...t/datafusion/src/physical_plan/coalesce_batches.rs 88.23% <100.00%> (ø)
rust/arrow/src/array/array_string.rs 89.94% <0.00%> (-0.22%) ⬇️
rust/arrow/src/array/transform/mod.rs 89.01% <0.00%> (+0.72%) ⬆️
rust/arrow/src/ffi.rs 72.22% <0.00%> (+2.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5228ede...33b96f3. Read the comment docs.

@Dandandan
Copy link
Contributor Author

FYI @jorgecarleitao @andygrove

@Dandandan Dandandan changed the title ARROW-11030: [Rust][DataFusion] [WIP] Concat to single batch in HashJoinExec ARROW-11030: [Rust][DataFusion] [WIP] Concatenate left side batches to single batch in HashJoinExec Jan 1, 2021
@andygrove
Copy link
Member

I'm seeing a 2x improvement in performance for q12 @ SF=100 with batch size 4096 🚀

This should allow us to reduce the default batch size and reduce memory pressure.

@Dandandan Dandandan marked this pull request as ready for review January 3, 2021 04:42
@Dandandan Dandandan changed the title ARROW-11030: [Rust][DataFusion] [WIP] Concatenate left side batches to single batch in HashJoinExec ARROW-11030: [Rust][DataFusion] Concatenate left side batches to single batch in HashJoinExec Jan 3, 2021
use crate::error::{DataFusionError, Result};

use super::{ExecutionPlan, Partitioning, RecordBatchStream, SendableRecordBatchStream};
use crate::physical_plan::coalesce_batches::concat_batches;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place to put this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably go into the arrow crate since it isn't specific to DataFusion.


let mut key = Vec::with_capacity(keys_values.len());

let mut left_indices = UInt64Builder::new(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems better for inner join to not use builder / allocate bitmap for nulls? What is the current most ergonomic way to do this?


for _ in 0..indices.len() {
// on an inner join, left and right indices are present
right_indices.append_value(row as u32)?;
Copy link
Contributor Author

@Dandandan Dandandan Jan 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could use something like append_n

RecordBatch::try_new(Arc::new(schema.clone()), columns)
}

/// Create a key `Vec<u8>` that is used as key for the hashmap
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently looks like creating this key / hashing is most expensive part of the queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan
Copy link
Contributor Author

Dandandan commented Jan 3, 2021

@andygrove @jorgecarleitao

Is now ready for review.
I added some comments myself too.

The PR now includes the following:

  • The main change to collect into 1 batch, and use take directly
  • Build info about column indices upfront
  • Speed improvement in create_key (doing array.value(i) twice most importantly), this also makes q1 and q5 run a bit faster
  • Misc refactoring

// E.g. [1, 2] -> [(0, 3), (1, 6), (0, 8)] indicates that (column1, column2) = [1, 2] is true
// for rows 3 and 8 from batch 0 and row 6 from batch 1.
type JoinHashMap = HashMap<Vec<u8>, Vec<Index>, RandomState>;
type JoinLeftData = Arc<(JoinHashMap, Vec<RecordBatch>)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does mean that we can only use this join operator when the data can fit into a single batch. For example, if we are using 32-bit offsets then this limits us to 4GB of string data in a single column.

I don't know if we have any users that care about this scenario or not.

Copy link
Contributor Author

@Dandandan Dandandan Jan 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but as far as I can tell only limits the max nr of elements to 2 ^ 32 for each batch / array on the right?
I think the string data per right record batch itself should be able to hold more than 4GB because the string offsets are stored separately, or am I wrong about that?
If this becomes a problem in the future, we can change it to use u64 for the right indices as well, I think without too much extra cost besides double allocation/memory usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow spec supports both 32-bit and 64-bit variants for the offset vector, so in the 32-bit case, each batch would be limited to 2^32 bytes of variable-width data per array.

Copy link
Contributor Author

@Dandandan Dandandan Jan 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the types of the index used in the hash join has no effect of the type being used in the string array offsets. So the offsets values inside the string array might be using 64-bit values, while here we use 32-bit to index into the string elements, which would allow more than 4GB? Anyway, at some point there is a limit we should keep in mind 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also FYI, for the left side we are now using 64 bit indices, so the 2^ 32 limit currently should be only on the number of elements per right batch, which seems reasonable to me (and easy to get rid of that constraint). Also, on the other hand, a future way to save some memory could be to choose for 32 bit indices for the left side whenever there are less than 2 ^ 32 items in the build side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean now...

Copy link
Contributor Author

@Dandandan Dandandan Jan 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a bit less trivial to get rid of than I thought, but at some point you could make a case either to convert those string arrays to use 64 bit indexes or to split the batches to have 2^32 size, or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation @andygrove 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I would love to see a join approach for larger datasets that doesn't require giant buffers (eg. 64 bit indexes) but a runtime switch over to using sort/merge join or its equivalent. But that is a pipe dream for me at the moment.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM and I re-tested and still see a ~2x speedup for TPC-H q12 @ SF=100 with batch size 4096.

I added a comment noting the restriction when using 32-bit offsets. I think it would be reasonable to address this in the future if it becomes an issue for anyone.

@jorgecarleitao
Copy link
Member

@andygrove , isn't this going in the direction that we were discussing in https://issues.apache.org/jira/browse/ARROW-11058 wrt to the current lack of benefits in having multiple batches (though you just mentioned a benefit: indexing size)?

@andygrove
Copy link
Member

andygrove commented Jan 3, 2021

Yes @jorgecarleitao reducing the number of batches makes sense here. I am just slightly concerned that always reducing to a single batch may be introducing a constraint that could bite us in the future, although this is realistically only an issue when using 32-bit offsets for variable-width arrays.

@Dandandan
Copy link
Contributor Author

@jorgecarleitao @andygrove
Do you think this design is good enough for now?
For now it means anyone storing > 4GB variable size data in a build-side column should opt to use 64-bit offsets, and in the future one option would be to automatically detect this and apply a cast when needed. I think that should have the least performance implications.

@andygrove
Copy link
Member

@Dandandan Yes, I personally think this is fine for now (hence the approval) and since no-one has objected I think we can go ahead and merge this.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Dandandan
Copy link
Contributor Author

FYI I am planning to on working on a the first changes https://issues.apache.org/jira/browse/ARROW-11112 basef on this PR/branch later this week, to start getting rid of the hash key creation and start prepare the algorithm to use a vectorized implementation.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan !

// E.g. [1, 2] -> [(0, 3), (1, 6), (0, 8)] indicates that (column1, column2) = [1, 2] is true
// for rows 3 and 8 from batch 0 and row 6 from batch 1.
type JoinHashMap = HashMap<Vec<u8>, Vec<Index>, RandomState>;
type JoinLeftData = Arc<(JoinHashMap, Vec<RecordBatch>)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I would love to see a join approach for larger datasets that doesn't require giant buffers (eg. 64 bit indexes) but a runtime switch over to using sort/merge join or its equivalent. But that is a pipe dream for me at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants