-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (#12068) #13744
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…tarRocks#12068) Currently, RowsetUpdateState::load will preload all segments primary keys into memory, if the load(rowset) is very large, it will use a lot of memory during the commit or apply phrase. For large load(rowset), we don't preload all segment's primary keys but process segment by segment, which can reduce the memory usage during apply. It is important to note that the limitation is a soft limit because we can't tolerate the failure to apply, so memory usage may still exceed the limitation. In my test env, one BE with two HDD, using Broker load, create a table with persistent index: use tpcds to test create table sql, using broker load: CREATE TABLE `store_sales` ( `ss_item_sk` bigint(20) NOT NULL COMMENT "", `ss_ticket_number` bigint(20) NOT NULL COMMENT "", `ss_sold_date_sk` bigint(20) NULL COMMENT "", `ss_sold_time_sk` bigint(20) NULL COMMENT "", `ss_customer_sk` bigint(20) NULL COMMENT "", `ss_cdemo_sk` bigint(20) NULL COMMENT "", `ss_hdemo_sk` bigint(20) NULL COMMENT "", `ss_addr_sk` bigint(20) NULL COMMENT "", `ss_store_sk` bigint(20) NULL COMMENT "", `ss_promo_sk` bigint(20) NULL COMMENT "", `ss_quantity` bigint(20) NULL COMMENT "", `ss_wholesale_cost` decimal64(7, 2) NULL COMMENT "", `ss_list_price` decimal64(7, 2) NULL COMMENT "", `ss_sales_price` decimal64(7, 2) NULL COMMENT "", `ss_ext_discount_amt` decimal64(7, 2) NULL COMMENT "", `ss_ext_sales_price` decimal64(7, 2) NULL COMMENT "", `ss_ext_wholesale_cost` decimal64(7, 2) NULL COMMENT "", `ss_ext_list_price` decimal64(7, 2) NULL COMMENT "", `ss_ext_tax` decimal64(7, 2) NULL COMMENT "", `ss_coupon_amt` decimal64(7, 2) NULL COMMENT "", `ss_net_paid` decimal64(7, 2) NULL COMMENT "", `ss_net_paid_inc_tax` decimal64(7, 2) NULL COMMENT "", `ss_net_profit` decimal64(7, 2) NULL COMMENT "" ) ENGINE=OLAP PRIMARY KEY(`ss_item_sk`, `ss_ticket_number`) COMMENT "OLAP" DISTRIBUTED BY HASH(`ss_item_sk`, `ss_ticket_number`) BUCKETS 2 PROPERTIES ( "replication_num" = "1", "in_memory" = "false", "storage_format" = "DEFAULT", "enable_persistent_index" = "true", "compression" = "LZ4" ); PrimaryKey Length RowNum BucketNum Load time(s) Apply time(ms) Peak Memory usage(GB) Note 16 Bytes 864001869 2 7643 355200 25.03 branch-opt 16 Bytes 864001869 2 7591 348465 46.45 branch-main 16 Bytes 864001869 100 7194 32705 25.11 branch-opt 16 Bytes 864001869 100 7104 30705 43.14 branch-main Note there are still some scenarios we don't resolve in this pr: In the partial update, the read column data maybe very large and we don't resolve it in this pr We still need to load all primary key into L0 of persistent index first which maybe cause OOM
clang-tidy review says "All clean, LGTM! 👍" |
auto-merge was automatically disabled
November 25, 2022 06:37
Head branch was pushed to by a user without write access
chaoyli
approved these changes
Nov 25, 2022
run starrocks_be_unittest |
clang-tidy review says "All clean, LGTM! 👍" |
chaoyli
changed the title
[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (…
[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (#12068)
Nov 27, 2022
clang-tidy review says "All clean, LGTM! 👍" |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this:
Which issues of this PR fixes :
Fixes #
Problem Summary(Required) :
Currently, RowsetUpdateState::load will preload all segments primary keys into memory, if the load(rowset) is very large, it will use a lot of memory during the commit or apply phrase.
For large load(rowset), we don't preload all segment's primary keys but process segment by segment, which can reduce the memory usage during apply.
It is important to note that the limitation is a soft limit because we can't tolerate the failure to apply, so memory usage may still exceed the limitation.
In my test env, one BE with two HDD, using Broker load, create a table with persistent index:
use tpcds to test
create table sql, using broker load:
CREATE TABLE
store_sales
(ss_item_sk
bigint(20) NOT NULL COMMENT "",ss_ticket_number
bigint(20) NOT NULL COMMENT "",ss_sold_date_sk
bigint(20) NULL COMMENT "",ss_sold_time_sk
bigint(20) NULL COMMENT "",ss_customer_sk
bigint(20) NULL COMMENT "",ss_cdemo_sk
bigint(20) NULL COMMENT "",ss_hdemo_sk
bigint(20) NULL COMMENT "",ss_addr_sk
bigint(20) NULL COMMENT "",ss_store_sk
bigint(20) NULL COMMENT "",ss_promo_sk
bigint(20) NULL COMMENT "",ss_quantity
bigint(20) NULL COMMENT "",ss_wholesale_cost
decimal64(7, 2) NULL COMMENT "",ss_list_price
decimal64(7, 2) NULL COMMENT "",ss_sales_price
decimal64(7, 2) NULL COMMENT "",ss_ext_discount_amt
decimal64(7, 2) NULL COMMENT "",ss_ext_sales_price
decimal64(7, 2) NULL COMMENT "",ss_ext_wholesale_cost
decimal64(7, 2) NULL COMMENT "",ss_ext_list_price
decimal64(7, 2) NULL COMMENT "",ss_ext_tax
decimal64(7, 2) NULL COMMENT "",ss_coupon_amt
decimal64(7, 2) NULL COMMENT "",ss_net_paid
decimal64(7, 2) NULL COMMENT "",ss_net_paid_inc_tax
decimal64(7, 2) NULL COMMENT "",ss_net_profit
decimal64(7, 2) NULL COMMENT "") ENGINE=OLAP
PRIMARY KEY(
ss_item_sk
,ss_ticket_number
)COMMENT "OLAP"
DISTRIBUTED BY HASH(
ss_item_sk
,ss_ticket_number
) BUCKETS 2 PROPERTIES ("replication_num" = "1",
"in_memory" = "false",
"storage_format" = "DEFAULT",
"enable_persistent_index" = "true",
"compression" = "LZ4"
);
PrimaryKey Length RowNum BucketNum Load time(s) Apply time(ms) Peak Memory usage(GB) Note 16 Bytes 864001869 2 7643 355200 25.03 branch-opt
16 Bytes 864001869 2 7591 348465 46.45 branch-main 16 Bytes 864001869 100 7194 32705 25.11 branch-opt 16 Bytes 864001869 100 7104 30705 43.14 branch-main Note there are still some scenarios we don't resolve in this pr:
In the partial update, the read column data maybe very large and we don't resolve it in this pr We still need to load all primary key into L0 of persistent index first which maybe cause OOM
Checklist: