-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregator support batch serialize #9777
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ff696d3
to
8bbb062
Compare
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
30a5b1c
to
241537d
Compare
fc95526
to
3354b7f
Compare
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
3bb7660
to
fdac3cd
Compare
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
…h into hashagg_batch_serialize
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
bb98d79
to
a2f168e
Compare
Signed-off-by: guo-shaoge <[email protected]>
dbms/src/Common/ColumnsHashing.h
Outdated
class KeyStringBatchHandlerBase | ||
{ | ||
private: | ||
size_t batch_row_idx = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe renamed to processed_row_count to make it easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
dbms/src/Common/ColumnsHashing.h
Outdated
class KeySerializedBatchHandlerBase | ||
{ | ||
private: | ||
size_t batch_row_idx = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
RUNTIME_CHECK(max_batch_size >= 256); | ||
batch_row_idx = start_row; | ||
sort_key_containers.resize(max_batch_size); | ||
batch_rows.reserve(max_batch_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little strange here, for one uses resize while the other uses reserve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_rows
will be resize
each time prepareNextBatchType
is called.
void init(size_t start_row, const ColumnRawPtrs & key_columns, const TiDB::TiDBCollators & collators) | ||
{ | ||
batch_row_idx = start_row; | ||
byte_size.resize_fill_zero(key_columns[0]->size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already have start_row here, do we need to resize to total size of key_columns[0]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
countSerializeByteSizeForCmp
doesn't have start_row parameter. So we have to resize byte_size
to the total size.
byte_size.resize_fill_zero(key_columns[0]->size()); | ||
RUNTIME_CHECK(!byte_size.empty()); | ||
for (size_t i = 0; i < key_columns.size(); ++i) | ||
key_columns[i]->countSerializeByteSizeForCmp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the following code, each non-null, non-collation string column will add sizeof(UINT32) to byte_size, not sure for multiple string columns, do we need to add this for each?
tiflash/dbms/src/Columns/ColumnString.cpp
Line 582 in 94402d5
byte_size[i] += sizeof(UInt32) + sizeAt(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figure out, we do need to add for each, as a type of separator here.
&sort_key_container); | ||
|
||
for (size_t i = 0; i < cur_batch_size; ++i) | ||
real_byte_size[i] = pos[i] - ori_pos[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will real_byte_size differ from byte_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the column string, the currently pre-allocated memory space is based on the maximum space that a UTF-8 character can occupy. However, in practice, many characters do not require such a large amount of space. Therefore, byte_size[i] for the column string represents the maximum occupied space, rather than the actual space used.
aggregates_pool, | ||
sort_key_containers); | ||
|
||
if constexpr (enable_prefetch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that hashvals not updated when enable_prefetch is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. When enable_prefetch is false and batch_get_key_holder is true, will use emplaceKey(method, state, key_holder)
to emplace into HashTable. So no need to init hashvals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hashvals is only for prefetch to use. so no need to compute hashval for other situations
Signed-off-by: guo-shaoge <[email protected]>
What problem does this PR solve?
Issue Number: close #9692
Problem Summary: Reduce virtual function call for
key_serialize
andkey_string
.Workload: tpch-50g
Queries: same with #9679
Workload: clickbench
Queries: https://github.com/ClickHouse/ClickBench/blob/fdfdb5d94f2a668dce1f63d55498aa34510e4c9c/clickhouse/queries.sql#L11
NOTE:
two_keys_num64_strbinpadding
. opt key iskey_serialized
.What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note