sql: cluster falls over trying to create 100.000 tables in db #40629

ParkJungsuk · 2019-09-10T13:03:12Z

Describe the problem

I tried to create more than 100,000 tables in a single database by some scripts, but seems the cluster crashed.

To Reproduce

What did you do? Describe in your own words.

I deployed a cluster with 3 nodes on K8s, following the instructions in https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes-insecure.html:

helm install --name bench stable/cockroachdb

After the cluster is available, I created a database and then trying to create a lot of tables without data, the script is something like:

go.sh:

#/bin/bash

concurrency=10
total_tables=1000000
table_per_concurrency=$(( $total_tables / $concurrency ))

# run processes and store pids in array
for (( i=0; i<$concurrency; i++ ))
do
    first=$(( i * $table_per_concurrency ))
    last=$(( $(( $(( $i + 1 )) * $table_per_concurrency )) - 1 ))
    ./create_table.sh $first $last 1>/dev/null 2>&1 &
    pids[$i]=$!
done

# wait for all pids
for pid in ${pids[*]}; do
    wait $pid
done

create_table.sh:

#/bin/bash

first=$1
last=$2

for (( i=$first; i<=$last; i++ ))
do
     ./cockroach sql --insecure --host $HOST -e "create table bench.table$i(a int);"
done

After several hours, I can see about 10,000 tables created, and then the cluster turned unavailable:

# kubectl get pod
NAME                                    READY   STATUS      RESTARTS   AGE
bench-cockroachdb-0                     0/1     Running     43         31h
bench-cockroachdb-1                     0/1     Running     0          31h
bench-cockroachdb-2                     0/1     Running     2          31h
bench-cockroachdb-init-n74ht            0/1     Completed   0          31h

There's also an error message from the UI:

I can see some warnings from the logs but I've no idea what happened:

W190909 07:29:42.241800 122 gossip/gossip.go:1499  [n3] first range unavailable; resolvers exhausted
I190909 07:29:42.242778 1573060 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n3] circuitbreaker: gossip [::]:26257->bench-cockroachdb-0.bench-cockroachdb.pks.svc.cluster.local:26257 event: BreakerReset
I190909 07:29:42.242803 1573060 gossip/client.go:128  [n3] started gossip client to bench-cockroachdb-0.bench-cockroachdb.pks.svc.cluster.local:26257
I190909 07:29:43.841894 1573046 gossip/client.go:128  [n3] started gossip client to bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257
I190909 07:29:43.842898 1573046 gossip/client.go:133  [n3] closing client to n2 (bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257): stopping outgoing client to n2 (bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257); already have incoming
I190909 07:29:44.443128 2878 server/status/runtime.go:500  [n3] runtime stats: 4.3 GiB RSS, 509 goroutines, 1.3 GiB/1.1 GiB/2.9 GiB GO alloc/idle/total, 1.0 GiB/1.2 GiB CGO alloc/total, 2968.3 CGO/sec, 224.8/26.6 %(u/s)time, 0.0 %gc (3x), 777 KiB/19 MiB (r/w)net
I190909 07:29:44.842976 1572370 gossip/client.go:128  [n3] started gossip client to bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257
I190909 07:29:44.943996 1572370 gossip/client.go:133  [n3] closing client to n2 (bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257): stopping outgoing client to n2 (bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257); already have incoming
W190909 07:29:44.982325 9256 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {bench-cockroachdb-0.bench-cockroachdb.pks.svc.cluster.local:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...

Environment:

CockroachDB version: v19.1.3
Server OS: Description: CentOS Linux release 7.6.1810 (Core)
Client app: cockroach sql

Additional context
What was the impact?
We may need a large number of databases and tables, this is just a test for it.

The text was updated successfully, but these errors were encountered:

ricardocrdb · 2019-09-24T20:31:07Z

Hey @ParkJungsuk

Just a few questions to start out. I understand that you are running a k8s cluster, but what type of hardware are these pods running on? How are the hardware resources allocated to the said pods?

Also, have you tried to drop the concurrency in your script to 1, and ensured that the task can be done at that rate before ramping up? Please let me know if this has been tested in this way.

knz · 2019-09-26T11:47:09Z

@dt @bdarnell this is another data point for what we were discussing the other day. You said 8000 was fine, here we have evidence that 100.000 is not.

ricardocrdb · 2019-11-12T19:45:41Z

Hey @ParkJungsuk just wanted to follow up, could you provide the information that we were requesting referencing the resources and the concurrency? Let me know if you have any questions.

ricardocrdb · 2019-11-21T15:50:53Z

Closing due to inactivity. If you are still having the issue, please feel free to respond to this thread. We want to help!

knz · 2019-11-21T17:21:39Z

@ricardocrdb I'm going to re-open this but also take it out of your hands. We had an eng discussion yesterday about this very topic and we need an open issue to track.

github-actions · 2021-06-04T19:50:48Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz · 2021-06-04T19:59:53Z

still current

ajwerner · 2021-06-08T14:12:32Z

This is known and we are working to eliminate this bottleneck. I'm closing this issue because this class of problem is tracked better in subsequent issues like #63206.

ricardocrdb added the O-community Originated from the community label Sep 10, 2019

ricardocrdb self-assigned this Sep 24, 2019

knz added A-schema-descriptors Relating to SQL table/db descriptor handling. C-investigation Further steps needed to qualify. C-label will change. labels Sep 26, 2019

ricardocrdb closed this as completed Nov 21, 2019

knz reopened this Nov 21, 2019

knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-investigation Further steps needed to qualify. C-label will change. labels Nov 21, 2019

knz unassigned ricardocrdb Nov 21, 2019

knz changed the title ~~CockroachDB stopped serving when trying to create large number of tables~~ sql: cluster falls over trying to create 100.000 tables in db Nov 21, 2019

knz added the S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors label Nov 21, 2019

github-actions bot added the no-issue-activity label Jun 4, 2021

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity labels Jun 4, 2021

ajwerner closed this as completed Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: cluster falls over trying to create 100.000 tables in db #40629

sql: cluster falls over trying to create 100.000 tables in db #40629

ParkJungsuk commented Sep 10, 2019

ricardocrdb commented Sep 24, 2019

knz commented Sep 26, 2019

ricardocrdb commented Nov 12, 2019

ricardocrdb commented Nov 21, 2019

knz commented Nov 21, 2019

github-actions bot commented Jun 4, 2021

knz commented Jun 4, 2021

ajwerner commented Jun 8, 2021

sql: cluster falls over trying to create 100.000 tables in db #40629

sql: cluster falls over trying to create 100.000 tables in db #40629

Comments

ParkJungsuk commented Sep 10, 2019

ricardocrdb commented Sep 24, 2019

knz commented Sep 26, 2019

ricardocrdb commented Nov 12, 2019

ricardocrdb commented Nov 21, 2019

knz commented Nov 21, 2019

github-actions bot commented Jun 4, 2021

knz commented Jun 4, 2021

ajwerner commented Jun 8, 2021