Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: cluster falls over trying to create 100.000 tables in db #40629

Closed
ParkJungsuk opened this issue Sep 10, 2019 · 8 comments
Closed

sql: cluster falls over trying to create 100.000 tables in db #40629

ParkJungsuk opened this issue Sep 10, 2019 · 8 comments
Labels
A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors

Comments

@ParkJungsuk
Copy link

Describe the problem

I tried to create more than 100,000 tables in a single database by some scripts, but seems the cluster crashed.

To Reproduce

What did you do? Describe in your own words.

  1. I deployed a cluster with 3 nodes on K8s, following the instructions in https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes-insecure.html:
helm install --name bench stable/cockroachdb
  1. After the cluster is available, I created a database and then trying to create a lot of tables without data, the script is something like:

go.sh:

#/bin/bash

concurrency=10
total_tables=1000000
table_per_concurrency=$(( $total_tables / $concurrency ))

# run processes and store pids in array
for (( i=0; i<$concurrency; i++ ))
do
    first=$(( i * $table_per_concurrency ))
    last=$(( $(( $(( $i + 1 )) * $table_per_concurrency )) - 1 ))
    ./create_table.sh $first $last 1>/dev/null 2>&1 &
    pids[$i]=$!
done

# wait for all pids
for pid in ${pids[*]}; do
    wait $pid
done

create_table.sh:

#/bin/bash

first=$1
last=$2

for (( i=$first; i<=$last; i++ ))
do
     ./cockroach sql --insecure --host $HOST -e "create table bench.table$i(a int);"
done
  1. After several hours, I can see about 10,000 tables created, and then the cluster turned unavailable:
# kubectl get pod
NAME                                    READY   STATUS      RESTARTS   AGE
bench-cockroachdb-0                     0/1     Running     43         31h
bench-cockroachdb-1                     0/1     Running     0          31h
bench-cockroachdb-2                     0/1     Running     2          31h
bench-cockroachdb-init-n74ht            0/1     Completed   0          31h

There's also an error message from the UI:

image

I can see some warnings from the logs but I've no idea what happened:

W190909 07:29:42.241800 122 gossip/gossip.go:1499  [n3] first range unavailable; resolvers exhausted
I190909 07:29:42.242778 1573060 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n3] circuitbreaker: gossip [::]:26257->bench-cockroachdb-0.bench-cockroachdb.pks.svc.cluster.local:26257 event: BreakerReset
I190909 07:29:42.242803 1573060 gossip/client.go:128  [n3] started gossip client to bench-cockroachdb-0.bench-cockroachdb.pks.svc.cluster.local:26257
I190909 07:29:43.841894 1573046 gossip/client.go:128  [n3] started gossip client to bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257
I190909 07:29:43.842898 1573046 gossip/client.go:133  [n3] closing client to n2 (bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257): stopping outgoing client to n2 (bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257); already have incoming
I190909 07:29:44.443128 2878 server/status/runtime.go:500  [n3] runtime stats: 4.3 GiB RSS, 509 goroutines, 1.3 GiB/1.1 GiB/2.9 GiB GO alloc/idle/total, 1.0 GiB/1.2 GiB CGO alloc/total, 2968.3 CGO/sec, 224.8/26.6 %(u/s)time, 0.0 %gc (3x), 777 KiB/19 MiB (r/w)net
I190909 07:29:44.842976 1572370 gossip/client.go:128  [n3] started gossip client to bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257
I190909 07:29:44.943996 1572370 gossip/client.go:133  [n3] closing client to n2 (bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257): stopping outgoing client to n2 (bench-cockroachdb-2.bench-cockroachdb.pks.svc.cluster.local:26257); already have incoming
W190909 07:29:44.982325 9256 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {bench-cockroachdb-0.bench-cockroachdb.pks.svc.cluster.local:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...

Environment:

  • CockroachDB version: v19.1.3
  • Server OS: Description: CentOS Linux release 7.6.1810 (Core)
  • Client app: cockroach sql

Additional context
What was the impact?
We may need a large number of databases and tables, this is just a test for it.

@ricardocrdb ricardocrdb added the O-community Originated from the community label Sep 10, 2019
@ricardocrdb ricardocrdb self-assigned this Sep 24, 2019
@ricardocrdb
Copy link

Hey @ParkJungsuk

Just a few questions to start out. I understand that you are running a k8s cluster, but what type of hardware are these pods running on? How are the hardware resources allocated to the said pods?

Also, have you tried to drop the concurrency in your script to 1, and ensured that the task can be done at that rate before ramping up? Please let me know if this has been tested in this way.

@knz
Copy link
Contributor

knz commented Sep 26, 2019

@dt @bdarnell this is another data point for what we were discussing the other day. You said 8000 was fine, here we have evidence that 100.000 is not.

@knz knz added A-schema-descriptors Relating to SQL table/db descriptor handling. C-investigation Further steps needed to qualify. C-label will change. labels Sep 26, 2019
@ricardocrdb
Copy link

Hey @ParkJungsuk just wanted to follow up, could you provide the information that we were requesting referencing the resources and the concurrency? Let me know if you have any questions.

@ricardocrdb
Copy link

Closing due to inactivity. If you are still having the issue, please feel free to respond to this thread. We want to help!

@knz
Copy link
Contributor

knz commented Nov 21, 2019

@ricardocrdb I'm going to re-open this but also take it out of your hands. We had an eng discussion yesterday about this very topic and we need an open issue to track.

@knz knz reopened this Nov 21, 2019
@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-investigation Further steps needed to qualify. C-label will change. labels Nov 21, 2019
@knz knz changed the title CockroachDB stopped serving when trying to create large number of tables sql: cluster falls over trying to create 100.000 tables in db Nov 21, 2019
@knz knz added the S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors label Nov 21, 2019
@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz
Copy link
Contributor

knz commented Jun 4, 2021

still current

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity labels Jun 4, 2021
@ajwerner
Copy link
Contributor

ajwerner commented Jun 8, 2021

This is known and we are working to eliminate this bottleneck. I'm closing this issue because this class of problem is tracked better in subsequent issues like #63206.

@ajwerner ajwerner closed this as completed Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors
Projects
None yet
Development

No branches or pull requests

4 participants