[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (#12068) #13744

sevev · 2022-11-21T06:10:45Z

What type of PR is this：

Which issues of this PR fixes ：

Fixes #

Problem Summary(Required) ：

Currently, RowsetUpdateState::load will preload all segments primary keys into memory, if the load(rowset) is very large, it will use a lot of memory during the commit or apply phrase.

For large load(rowset), we don't preload all segment's primary keys but process segment by segment, which can reduce the memory usage during apply.

It is important to note that the limitation is a soft limit because we can't tolerate the failure to apply, so memory usage may still exceed the limitation.

In my test env, one BE with two HDD, using Broker load, create a table with persistent index:

use tpcds to test
create table sql, using broker load:

CREATE TABLE store_sales (
ss_item_sk bigint(20) NOT NULL COMMENT "",
ss_ticket_number bigint(20) NOT NULL COMMENT "",
ss_sold_date_sk bigint(20) NULL COMMENT "",
ss_sold_time_sk bigint(20) NULL COMMENT "",
ss_customer_sk bigint(20) NULL COMMENT "",
ss_cdemo_sk bigint(20) NULL COMMENT "",
ss_hdemo_sk bigint(20) NULL COMMENT "",
ss_addr_sk bigint(20) NULL COMMENT "",
ss_store_sk bigint(20) NULL COMMENT "",
ss_promo_sk bigint(20) NULL COMMENT "",
ss_quantity bigint(20) NULL COMMENT "",
ss_wholesale_cost decimal64(7, 2) NULL COMMENT "",
ss_list_price decimal64(7, 2) NULL COMMENT "",
ss_sales_price decimal64(7, 2) NULL COMMENT "",
ss_ext_discount_amt decimal64(7, 2) NULL COMMENT "",
ss_ext_sales_price decimal64(7, 2) NULL COMMENT "",
ss_ext_wholesale_cost decimal64(7, 2) NULL COMMENT "",
ss_ext_list_price decimal64(7, 2) NULL COMMENT "",
ss_ext_tax decimal64(7, 2) NULL COMMENT "",
ss_coupon_amt decimal64(7, 2) NULL COMMENT "",
ss_net_paid decimal64(7, 2) NULL COMMENT "",
ss_net_paid_inc_tax decimal64(7, 2) NULL COMMENT "",
ss_net_profit decimal64(7, 2) NULL COMMENT ""
) ENGINE=OLAP
PRIMARY KEY(ss_item_sk, ss_ticket_number)
COMMENT "OLAP"
DISTRIBUTED BY HASH(ss_item_sk, ss_ticket_number) BUCKETS 2 PROPERTIES (
"replication_num" = "1",
"in_memory" = "false",
"storage_format" = "DEFAULT",
"enable_persistent_index" = "true",
"compression" = "LZ4"
);
PrimaryKey Length RowNum BucketNum Load time(s) Apply time(ms) Peak Memory usage(GB) Note 16 Bytes 864001869 2 7643 355200 25.03 branch-opt
16 Bytes 864001869 2 7591 348465 46.45 branch-main 16 Bytes 864001869 100 7194 32705 25.11 branch-opt 16 Bytes 864001869 100 7104 30705 43.14 branch-main Note there are still some scenarios we don't resolve in this pr:

In the partial update, the read column data maybe very large and we don't resolve it in this pr We still need to load all primary key into L0 of persistent index first which maybe cause OOM

Checklist:

I have added test cases for my bug fix or my new feature
This pr will affect users' behaviors
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function

…tarRocks#12068) Currently, RowsetUpdateState::load will preload all segments primary keys into memory, if the load(rowset) is very large, it will use a lot of memory during the commit or apply phrase. For large load(rowset), we don't preload all segment's primary keys but process segment by segment, which can reduce the memory usage during apply. It is important to note that the limitation is a soft limit because we can't tolerate the failure to apply, so memory usage may still exceed the limitation. In my test env, one BE with two HDD, using Broker load, create a table with persistent index: use tpcds to test create table sql, using broker load: CREATE TABLE `store_sales` ( `ss_item_sk` bigint(20) NOT NULL COMMENT "", `ss_ticket_number` bigint(20) NOT NULL COMMENT "", `ss_sold_date_sk` bigint(20) NULL COMMENT "", `ss_sold_time_sk` bigint(20) NULL COMMENT "", `ss_customer_sk` bigint(20) NULL COMMENT "", `ss_cdemo_sk` bigint(20) NULL COMMENT "", `ss_hdemo_sk` bigint(20) NULL COMMENT "", `ss_addr_sk` bigint(20) NULL COMMENT "", `ss_store_sk` bigint(20) NULL COMMENT "", `ss_promo_sk` bigint(20) NULL COMMENT "", `ss_quantity` bigint(20) NULL COMMENT "", `ss_wholesale_cost` decimal64(7, 2) NULL COMMENT "", `ss_list_price` decimal64(7, 2) NULL COMMENT "", `ss_sales_price` decimal64(7, 2) NULL COMMENT "", `ss_ext_discount_amt` decimal64(7, 2) NULL COMMENT "", `ss_ext_sales_price` decimal64(7, 2) NULL COMMENT "", `ss_ext_wholesale_cost` decimal64(7, 2) NULL COMMENT "", `ss_ext_list_price` decimal64(7, 2) NULL COMMENT "", `ss_ext_tax` decimal64(7, 2) NULL COMMENT "", `ss_coupon_amt` decimal64(7, 2) NULL COMMENT "", `ss_net_paid` decimal64(7, 2) NULL COMMENT "", `ss_net_paid_inc_tax` decimal64(7, 2) NULL COMMENT "", `ss_net_profit` decimal64(7, 2) NULL COMMENT "" ) ENGINE=OLAP PRIMARY KEY(`ss_item_sk`, `ss_ticket_number`) COMMENT "OLAP" DISTRIBUTED BY HASH(`ss_item_sk`, `ss_ticket_number`) BUCKETS 2 PROPERTIES ( "replication_num" = "1", "in_memory" = "false", "storage_format" = "DEFAULT", "enable_persistent_index" = "true", "compression" = "LZ4" ); PrimaryKey Length RowNum BucketNum Load time(s) Apply time(ms) Peak Memory usage(GB) Note 16 Bytes 864001869 2 7643 355200 25.03 branch-opt 16 Bytes 864001869 2 7591 348465 46.45 branch-main 16 Bytes 864001869 100 7194 32705 25.11 branch-opt 16 Bytes 864001869 100 7104 30705 43.14 branch-main Note there are still some scenarios we don't resolve in this pr: In the partial update, the read column data maybe very large and we don't resolve it in this pr We still need to load all primary key into L0 of persistent index first which maybe cause OOM

github-actions · 2022-11-21T06:33:21Z

clang-tidy review says "All clean, LGTM! 👍"

sevev · 2022-11-25T08:30:13Z

run starrocks_be_unittest

github-actions · 2022-11-25T09:13:34Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2022-11-27T03:18:38Z

clang-tidy review says "All clean, LGTM! 👍"

sonarcloud · 2022-11-27T03:19:40Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.7% Duplication

wanpengfei-git added the automerge label Nov 21, 2022

wanpengfei-git enabled auto-merge (rebase) November 21, 2022 06:11

wanpengfei-git added the be-build label Nov 21, 2022

Merge branch 'branch-2.5' into branch-2.5

b45a9c6

auto-merge was automatically disabled November 25, 2022 06:37
Head branch was pushed to by a user without write access

github-actions bot removed the be-build label Nov 25, 2022

chaoyli approved these changes Nov 25, 2022

View reviewed changes

wanpengfei-git added the be-build label Nov 25, 2022

chaoyli changed the title ~~[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (…~~ [Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (#12068) Nov 27, 2022

chaoyli merged commit fcc14b2 into StarRocks:branch-2.5 Nov 27, 2022

github-actions bot removed the be-build label Nov 27, 2022

wanpengfei-git added the be-build label Nov 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (#12068) #13744

[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (#12068) #13744

sevev commented Nov 21, 2022 •

edited by chaoyli

Loading

github-actions bot commented Nov 21, 2022

sevev commented Nov 25, 2022

github-actions bot commented Nov 25, 2022

github-actions bot commented Nov 27, 2022

sonarcloud bot commented Nov 27, 2022

[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (#12068) #13744

[Enhancement][Cherry-pick][Branch-2.5] Optimize memory usage of primary key table large load (#12068) #13744

Conversation

sevev commented Nov 21, 2022 • edited by chaoyli Loading

What type of PR is this：

Which issues of this PR fixes ：

Problem Summary(Required) ：

Checklist:

github-actions bot commented Nov 21, 2022

sevev commented Nov 25, 2022

github-actions bot commented Nov 25, 2022

github-actions bot commented Nov 27, 2022

sonarcloud bot commented Nov 27, 2022

sevev commented Nov 21, 2022 •

edited by chaoyli

Loading