Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One TiKV instance keeps OOM when using "lightning tidb backend" to import 2T data #22964

Closed
cosven opened this issue Feb 26, 2021 · 13 comments
Closed
Assignees
Milestone

Comments

@cosven
Copy link
Contributor

cosven commented Feb 26, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. deploy a tidb cluster with 3db/3pd/5kv
  2. go-tpc generate csv
  3. lightning import csv with 'tidb backend'

2. What did you expect to see? (Required)

No OOM.

3. What did you see instead (Required)

One TiKV instance OOM again and again.
image

4. What is your TiDB version? (Required)

lightning: v5.0.0-rc

kv

root@726fbd54d142:/disk1/deploy/tikv-20160# bin/tikv-server -V
TiKV
Release Version:   5.0.0-rc.x
Edition:           Community
Git Commit Hash:   695d143c2bf68f13e2dd5da7dd1e0d8acd2448c7
Git Commit Branch: master
UTC Build Time:    2021-02-24 09:52:52
Rust Version:      rustc 1.51.0-nightly (1d0d76f8d 2021-01-24)
Enable Features:   jemalloc mem-profiling portable sse protobuf-codec test-engines-rocksdb cloud-aws cloud-gcp
Profile:           dist_release

db

Release Version: v4.0.0-beta.2-2187-g1970a917c\nEdition: Community\nGit Commit Hash: 1970a917c175665c3510ea57a1ea1d417e34f4ee\nGit Branch: master\nUTC Build Time: 2021-02-24 13:06:30\nGoVersion: go1.13\nRace Enabled: false\nTiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306\nCheck Table Before Drop: false
@cosven cosven added type/bug The issue is confirmed as a bug. severity/major labels Feb 26, 2021
@cosven
Copy link
Contributor Author

cosven commented Feb 26, 2021

/severity critical

@jebter
Copy link

jebter commented Feb 26, 2021

/cc @zhangjinpeng1987 @innerr

@overvenus
Copy link
Member

What's your lightning version?

@cosven
Copy link
Contributor Author

cosven commented Mar 1, 2021

What's your lightning version?

v5.0.0-rc.

@cosven
Copy link
Contributor Author

cosven commented Mar 2, 2021

It can be reproduced with following steps:

  1. start a cluster with nightly or release-5.0-nightly
  2. insert data into tidb with go-tpc -H xxx -P 3306 -D tpcc tpcc prepare -T 16 --warehouses 30000
  3. after several minutes
  4. one of the five tikv server keeps OOM

@hicqu
Copy link
Contributor

hicqu commented Mar 3, 2021

With jeprof we get an allocation map: a.svg.zip.

We can see that besides 4GB block cache, about 5GB memory is allocated in fetch_entries_to and its inner calls. I guess those entries are stored in EntryCache.

In this case there are about 40k entries per TiKV instance, so EntryCache is the mainly reason to trigger OOM killer besides block cache.

@hicqu
Copy link
Contributor

hicqu commented Mar 4, 2021

It's not true that Entrys are staging in EntryCache, because with a.svg.zip we can know that those entries are created in gen_light_ready, which means thay are committed entries. Committed entries won't be put into EntryCache again, they can only be sent to apply workers's input channel.

The relation code is

peers = fetch_fsm(); // 256 at max
for fsm in peers {
    fsm.handle_normal(); // will generate a ready for every fsm
}
peers.end(); // will handle all ready, send committed entries into apply workers.

So if there are about 40MB committed raft logs for a peer, the total memory of committed_entries can reach about 10GB, which can causes OOM easily.

@hicqu
Copy link
Contributor

hicqu commented Mar 4, 2021

I suggest to merge and introduce this feature to 5.0 to avoid the OOM: tikv/raft-rs#356.

@jebter jebter added this to the v5.0.0 ga milestone Mar 21, 2021
@NingLin-P
Copy link
Contributor

tikv/raft-rs#356 may not solve the problem completely, because it only limits the number of committed entries within each Ready, but if TiKV apply log not fast enough and the store thread run fast and produce Ready without any limit, it can produce multiple Ready for all committed entries, and cause OOM like this issue.

@cosven
Copy link
Contributor Author

cosven commented Apr 2, 2021

I reproduce the problem with following workload and topology, and there is no panic before OOM.

workload

go-tpc -H 10.0.2.27 -P 4000 -D tpcc10k tpcc prepare -T 100 --warehouses 10000

topology

  • each TiDB: 16c32G
  • each PD: 8c16g
  • each TiKV: 8c16G
server_configs:
  tidb:
    enable-forwarding: true
  tikv: {}
  pd: {}
  tiflash: {}
  tiflash-learner: {}
  pump: {}
  drainer: {}
  cdc: {}
tidb_servers:
- host: 10.0.2.27
  ssh_port: 22
  port: 4000
  status_port: 10080
tikv_servers:
- host: 10.0.2.21
  ssh_port: 22
  port: 20160
  status_port: 20180
- host: 10.0.2.23
  ssh_port: 22
  port: 20160
  status_port: 20180
- host: 10.0.2.28
  ssh_port: 22
  port: 20160
  status_port: 20180
tiflash_servers: []
pd_servers:
- host: 10.0.2.25
  ssh_port: 22
  name: ""
  client_port: 2379
  peer_port: 2380
- host: 10.0.2.19
  ssh_port: 22
  name: ""
  client_port: 2379
  peer_port: 2380
- host: 10.0.2.26
  ssh_port: 22
  name: ""
  client_port: 2379
  peer_port: 2380

The 10.0.2.23 tikv instance keeps OOM. Some metrics screenshot
image
image
image

@cosven
Copy link
Contributor Author

cosven commented Apr 6, 2021

Reproduced with following workload

go-tpc -H $tidbHost -P $tidbPort -D tpcc10k tpcc prepare \
  -T 200 --warehouses 10000

image

Topology

  • 1 TiDB: 8c16G
  • 1 PD: 8c16g
  • 5 TiKV: 8c16G

@hicqu
Copy link
Contributor

hicqu commented Nov 22, 2021

Fixed by tikv/tikv#10334.

@github-actions
Copy link

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants