Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polynomial latency growth for database isolated ("Bridge") multi-tenant models #63206

Closed
aeneasr opened this issue Apr 7, 2021 · 18 comments
Closed
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner

Comments

@aeneasr
Copy link

aeneasr commented Apr 7, 2021

Describe the problem

We are running a multi-tenant system which isolates tenants on a per-database level ("Bridge model"). The documentation recommends using schemas ("Limit the number of databases you create. If you need to create multiple tables with the same name in your cluster, do so in different user-defined schemas, in the same database.").

In #60724 and our Cockroach Cloud Ticket 7926 we were told to rely on database isolation instead. Unfortunately, here too we find CockroachDB struggling with this multi-tenant model.

CockroachDB does not work well with running stressful workloads and schema changes in parallel - which is documented and we have observed this in production as well. Effects of this issue are:

  1. Creating new tenants (databases + tables + indices; no data) takes increasingly more time the more tenants there are.
  2. Memory consumption increases over time - all of our Cockroach Cloud cluster nodes are at 85% memory usage.
  3. Queries become extremely slow with execution times of up to 60s for simple SELECT * FROM foo; (~500 rows) statements while tenant workloads (create database, tables, indices; no data) are running.
  4. The effect is observable on both multi- and single-node.

To Reproduce

To reproduce this we clone one of our projects, you can find the migrations here, we create a single-node instance of CRDB with 4GB RAM and 2 vCPUs and then create 1000 tenants by creating 1000 DBs and execute a bunch of CREATE TABLE, DROP, ALTER, ADD INDEX, ... statements:

git clone https://github.com/ory/kratos.git
docker rm -f kratos_test_database_cockroach || true
docker run -m 4GB --cpus=2 --name kratos_test_database_cockroach -p 3446:26257 -p 3447:8080 -d cockroachdb/cockroach:v20.2.7 start-single-node --insecure

rm migrate.log
export LOG_LEVEL=debug
export DSN=cockroach://[email protected]:3446

cd kratos
go build -tags sqlite -o kratos .

# Create 1000 databases and run the SQL schema changes for all 1000 databases
for i in {1..1000}
do
    docker exec kratos_test_database_cockroach cockroach sql --insecure --execute="CREATE DATABASE tenant_$i"
    ./kratos migrate sql --yes "$DSN/tenant_$i?sslmode=disable" >> migrate.log 2>&1
done 

I was running this workload over night (8+ hours) and we are now at 526 databases, so the workload has not completed yet.

Expected behavior

Given that the databases are empty

root@:26257/defaultdb> use tenant_68;
SET

Time: 8ms total (execution 8ms / network 0ms)

root@:26257/tenant_68> show tables;
  schema_name |              table_name               | type  | owner | estimated_row_count
--------------+---------------------------------------+-------+-------+----------------------
  public      | continuity_containers                 | table | root  |                   0
  public      | courier_messages                      | table | root  |                   0
  public      | identities                            | table | root  |                   0
  public      | identity_credential_identifiers       | table | root  |                   0
  public      | identity_credential_types             | table | root  |                   0
  public      | identity_credentials                  | table | root  |                   0
  public      | identity_recovery_addresses           | table | root  |                   0
  public      | identity_recovery_tokens              | table | root  |                   0
  public      | identity_verifiable_addresses         | table | root  |                   0
  public      | identity_verification_tokens          | table | root  |                   0
  public      | schema_migration                      | table | root  |                 111
  public      | selfservice_errors                    | table | root  |                   0
  public      | selfservice_login_flow_methods        | table | root  |                   0
  public      | selfservice_login_flows               | table | root  |                   0
  public      | selfservice_recovery_flow_methods     | table | root  |                   0
  public      | selfservice_recovery_flows            | table | root  |                   0
  public      | selfservice_registration_flow_methods | table | root  |                   0
  public      | selfservice_registration_flows        | table | root  |                   0
  public      | selfservice_settings_flow_methods     | table | root  |                   0
  public      | selfservice_settings_flows            | table | root  |                   0
  public      | selfservice_verification_flow_methods | table | root  |                   0
  public      | selfservice_verification_flows        | table | root  |                   0
  public      | sessions                              | table | root  |                   0
(23 rows)

Time: 2.109s total (execution 2.109s / network 0.001s)

we would not expect such a siginficant increase in latency.

Additional data / screenshots

You can parse the log file and output it as CSV using

echo "migration,execution" > out.csv
cat migrate.log | grep seconds | awk '{print NR,$3}s' | sed -e 's/msg=//' | sed -e 's/ /,/' >> out.csv

which in our case shows polynomial (2nd degree) / exponential growth with R^2 of almost 1. I suspect we'll get R^2 > 1 at some point:

execution vs migration-6

All other metrics point towards the same issue - I have therefore included screenshots from all metrics dashboards:

**Usage (530MB)**

image

Overview

image

Hardware

image

Runtime

image

SQL

image

Storage

image

Replication

image

Distributed

image

Queues

image

Slow Requests

image

Changefeeds

image

Environment:

  • CockroachCloud 20.2.7
  • CockroachDB 20.2.7
  • Server OS: macOS / Docker, CockroachCloud
  • Client app: Any

Additional context

We are heavily impacted by this issue and it has been a month-long effort to somehow get multi-tenancy to work in CRDB. The current effects are:

  • Our production system which was scheduled to launch last month could not launch due to this issue and Slow (>50s) and failing schema change transactions #60724
  • CockroachDB does not offer Row Level Security (see AWS blog post) making shared-table ("Pool") multi-tenant models impossible to implement, especially for security-sensitive applications like ours.
  • On CockroachCloud ("production")� we observe worse and worse query performance (up to 60s for simple SELECT) with memory usage capping out at around 600 databases.
  • Running a multi-node set up (which we do in Cockroach Cloud) makes it worse.
  • If the time it takes to spin up a tenant increases with the amount of tenants in the system, we can not make release announcements on e.g. Hacker News (which hosts Ory on the FP regularly) as the traffic spikes would completely kill the system.

This currently ruins our economic model where we would like to offer a "free tier" with strong usage limits that developers can use to try out the system, as a $1200 p/mo CockroachCloud cluster can only handle around 300-400 tenants without them having any data!

We were also planning on requesting a multi-region cluster set up in CockroachCloud which we expect to become more expensive. All of this points us to the result where CockroachDB is currently not able to support multi-tenant systems in an economical fashion.

While we are big fans of this technology and want to introduce it to our 30k+ deployments, especially with the new releases of https://github.com/ory/hydra, https://github.com/ory/keto, https://github.com/ory/kratos we are left with a rather frustrating experience of going back and forth - being told to try out different things and in the end hitting the roadblock very quickly.

I have also tried to create a "debug zip" but unfortunately it can not be generated due to query timeouts:

23 tables found
requesting table details for tenant_103.public.schema_migration... writing: debug/schema/tenant_103/public_schema_migration.json.err.txt
  ^- resulted in operation "requesting table details for tenant_103.public.schema_migration" timed out after 10s
requesting table details for tenant_103.public.identities... writing: debug/schema/tenant_103/public_identities.json.err.txt
  ^- resulted in operation "requesting table details for tenant_103.public.identities" timed out after 10s
requesting table details for tenant_103.public.identity_credential_types... writing: debug/schema/tenant_103/public_identity_credential_types.json.err.txt
  ^- resulted in operation "requesting table details for tenant_103.public.identity_credential_types" timed out after 10s
requesting table details for tenant_103.public.identity_credentials... writing:

...

I then killed all workloads and re-ran debug.zip but it is taking painfully long for to collect the table details (one database a 23 tables takes about 30 seconds) so I aborted the process.

Comparison to PostgreSQL

To compare, with PostgreSQL

docker rm -f kratos_test_database_postgres || true
docker run -m 4GB --cpus=2  --name kratos_test_database_postgres -p 3445:5432 -e POSTGRES_PASSWORD=secret -e POSTGRES_DB=postgres -d postgres:13.2 postgres

sleep 5

rm migrate-psql.log
export LOG_LEVEL=debug
export DSN=postgres://postgres:[email protected]:3445
for i in {1..1000}
do
    docker exec kratos_test_database_postgres psql -U postgres -c "CREATE DATABASE tenant_$i"
    ./kratos migrate sql --yes "$DSN/tenant_$i?sslmode=disable" >> migrate-psql.log 2>&1
done

the container stays below 20% CPU usage

Bildschirmfoto 2021-04-07 um 08 30 09

requires less than 1/10th of the time to execute the statements:

cat migrate-psql.log | grep seconds | awk '{print NR,$3}s' | sed -e 's/msg=//' | sed -e 's/ /,/'
1,0.8378
2,0.7610
3,0.7952
4,0.7446
5,0.8967
6,0.7752
7,0.8294
8,0.7808
9,0.9094

and does not show any significant (it looks like O(1)) change in execution time (the spikes correlate with me closing browser tabs and using Google Sheets in Safari...):

execution vs migration-8

Jira issue: CRDB-6472

@aeneasr aeneasr added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 7, 2021
@blathers-crl
Copy link

blathers-crl bot commented Apr 7, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Apr 7, 2021
@aeneasr
Copy link
Author

aeneasr commented Apr 7, 2021

I ran the workload to completion (in ~1000 seconds) on PostgreSQL which is now sitting at ~2GB RAM and ~11%CPU when idling with 1000 databases and a O(1) complexity:

execution vs migration-9

@aeneasr
Copy link
Author

aeneasr commented Apr 7, 2021

I tried out CockroachDB v19.2.12 to see if it behaves differently from 20.2 but it appears to have the same issue:

execution vs migration-11

@hand-crdb
Copy link
Contributor

...in our case shows polynomial (2nd degree) / exponential growth with R^2 of almost 1. I suspect we'll get R^2 > 1 at some point

Nice data analysis. R^2 refers to the goodness-of-fit rather than the growth factor. Being close to 1 means that the trend line is a good fit to the data. As a googled reference says, "Essentially, an R-Squared value of 0.9 would indicate that 90% of the variance of the dependent variable being studied is explained by the variance of the independent variable." So R-squared can never be > 1

@ajwerner
Copy link
Contributor

ajwerner commented Apr 7, 2021

At its core this relates to the zone config infrastructure being and any change to anything in schema being an O(descriptors) operation on behalf of the writer and then triggering an update on each range which is O(ln(descriptors)). We have a string of scattered conversations about this on github but the fact of the matter is that this is reasonably understood and has long been deferred in favor of other work. We've tended to optimize to make things scale in the context of few tables with lots of data. Obviously that won't fly forever and I think this is part of a collective kick I deeply appreciate. The fact that the execution increases slightly super-linearly is also not so surprising.

Likely the most relevant conversation on GitHub is #47150 (comment).

I also recently advocated in this direction by posting #62128.

At the end of the day, you're butting up against a very real limitation of the architecture by which the SQL layer propagates changes which may impact configuration to the KV layer. It's far from a fundamental limitation but it's one that will take some engineering to get out of. I cannot promise you when we're going to do something about it but we very much want to do something about it.

@aeneasr aeneasr changed the title Polynomial / exponential latency growth with R^2 ~0.97 for database isolated ("Bridge") multi-tenant models Polynomial / exponential latency growth for database isolated ("Bridge") multi-tenant models Apr 7, 2021
@ajwerner
Copy link
Contributor

ajwerner commented Apr 7, 2021

I just want to say, I appreciate all the time and effort you put into this analysis and integration! It's inspiring and is the sort of engagement that motivates me at the start of the day. Sorry we've let you down thus far. I look forward to finding paths forward. This thread is making me wonder whether we should be looking for short-term, pragmatic gains here that have tradeoffs we've thus far been unhappy accepting in order to unblock use-cases which we'll fix more generally in the longer term.

@aeneasr
Copy link
Author

aeneasr commented Apr 7, 2021

Nice data analysis. R^2 refers to the goodness-of-fit rather than the growth factor. Being close to 1 means that the trend line is a good fit to the data. As a googled reference says, "Essentially, an R-Squared value of 0.9 would indicate that 90% of the variance of the dependent variable being studied is explained by the variance of the independent variable." So R-squared can never be > 1

My bad, I thought it meant something else but it makes sense in this context. I've updated the title accordingly! At least we know that the exponential model fits very well now 😅

At the end of the day, you're butting up against a very real limitation of the architecture by which the SQL layer propagates changes which may impact configuration to the KV layer. It's far from a fundamental limitation but it's one that will take some engineering to get out of. I cannot promise you when we're going to do something about it but we very much want to do something about it.

Thank you @ajwerner, always appreciate your responses! It sort of settled on me that this is looking like an architectual limitation at the moment as I tried out more and more set ups and repro cases. We went on a call today with Daniel and while there might be some ideas to hotfix / work around things by defining different range sizes it sounded like this would only treat the symptoms and stretch out the time a bit until we hit a wall. I understand that restructuring this layer needs a lot of hard work but between the lines I believe to understand that it is the direction you want to go in anyways as it will improve your economics as well.

I just want to say, I appreciate all the time and effort you put into this analysis and integration! It's inspiring and is the sort of engagement that motivates me at the start of the day. Sorry we've let you down thus far. I look forward to finding paths forward. This thread is making me wonder whether we should be looking for short-term, pragmatic gains here that have tradeoffs we've thus far been unhappy accepting in order to unblock use-cases which we'll fix more generally in the longer term.

Thank you - this is the vibe I've been getting from our interactions and why I really want to make this work with our system and (sometimes ;) ) enjoy spending some late nights. I think you're building something great here because so many things are so much ahead of the competition that it would be almost a crime not to use it. I think it would be tremendously helpful to find something to make it work somehow, even if it is in a controlled way (e.g. no surprise 10.000k new tenants in 24 hours but instead maybe max 10.000k larger tenants per cluster).

As far as I understood you are working on a multi-tenant system for CockroachCloud Free so I was wondering if there could be some alignment as to which approach you're taking, so that we can prepare and align there also. Generally, I think the "bridge" model we chose has both advantages and disadvantages but at scale it appears that industry consensus ("Google") favors it over Pool models. Much of the architecture design we chose is from e.g. Google Zanzibar paper which heavily uses Cloud Spanner and - for example - advocates for namespace / table-prefix type of isolation. From a data privacy, ACL, and operations model it kind of makes sense to me.

I'm wondering if you're on that avenue also for Cockroach Cloud Free or if you're choosing a different model (e.g. Pool / Row Level Isolation) and if so understand the reasons as to why.

@ajwerner
Copy link
Contributor

ajwerner commented Apr 7, 2021

As far as I understood you are working on a multi-tenant system for CockroachCloud Free so I was wondering if there could be some alignment as to which approach you're taking, so that we can prepare and align there also. Generally, I think the "bridge" model we chose has both advantages and disadvantages but at scale it appears that industry consensus ("Google") favors it over Pool models.

Indeed there is deep alignment between the multi-tenant design and this problem. We have chosen, in a sense, the same fundamental "bridge" architecture. Each tenant represents a portion of the keyspace under a prefix. The way we made that work, in a sense, is by cheating. In our current multi-tenant system, we make the KV layer blind to anything going on within a tenant. This means we give up entirely on the ability to configure the configuration (placement, size, constraints, gc ttl, etc). This is a big problem for our long term vision on lots of fronts. You should find these problems outlined in the above attached RFC PR.

A different source of scalability problems will be the number of ranges. At some number of tenants (and thus ranges) we would have a problem because things like processing queues etc would take too long. However, the root of the problem we're seeing here isn't the number of ranges. Very little in cockroach requires O(ranges) work. Even if it did, nothing latency sensitive and O(ranges) should be synchronous. My sense is that the bottleneck in that world would likely be things like placement decisions and those sorts of subsystems. I think other systems have proven that that sort of work can scale to at least millions. The more pressing problem is doing O(N) work in response to any user action, be it synchronously or asynchronously.

I'm not opposed to row-level security and, if I'm being honest, feels like a lower lift in terms of implementation than fixing this architectural problem, but it's not a fundamental solution. Say you did partition each each table by tenant and then had an independent configuration and RLS per tenant because you'd like to place each partition of each table in a different region. In that case we're back in the same pickle but using even less efficient abstractions (our subzones are just nested protos inside a proto, so even if we did optimize per-element changes, we'd now be propagating big changes to sets of subzones). My feeling is we should do both things.


Much of the architecture design we chose is from e.g. Google Zanzibar paper which heavily uses Cloud Spanner and - for example - advocates for namespace / table-prefix type of isolation. From a data privacy, ACL, and operations model it kind of makes sense to me.

Complete aside, I really like that paper. That being said, it's a pretty advanced use case. Propagating causailty tokens to objects in order to optimize authn latency for abstract policies is sweet but feels like a heavy lift for a lot of application developers. The concept of latency-optimized bounded staleness has come up a number of times in recent months. If you have feelings about what would be a good interface here, I'd love to hear about it. There was a bit of chatter publicly about this here: #45243 (comment)

@zepatrik
Copy link

zepatrik commented Apr 7, 2021

I like your summary:

I suspect clients generally would prefer to indicate their largest amount of tolerated staleness and then have the system provide the most up-to-date view which can be provided with a local read. This functionality is a Spanner feature is key for unlocking both correctness and low-latency for Zanzibar. This interface would likely prove both more valuable to end users and more possible to implement with reasonable guarantees.

From my understanding of zanzibar, this is not really a spanner feature, but rather this GPS time sync within Google's data centers. It allows them to use the current time and compare that to a timestamp stored alongside the object as well as the relation tuple. Spanner is just allowing follower reads, and the caches in the zanzibar nodes are also storing these timestamps. When evaluating a request, zanzibar can check if it got all the necessary data locally/regionally, otherwise it has to handle a request accordingly.

I guess the only other alternative to guarantee such consistency is using some kind of logical timestamp, as we really only care about the "older than" relation than the actual time. The reason why Google is using actual timestamps is because they have them available anyways, but it is not strictly required to use them.

Also, see my thoughts in this issue ory/keto#517

@knz
Copy link
Contributor

knz commented Apr 9, 2021

There are two things I was surprised to not see discussed here:

  • why is the user not leveraging our multi-tenant infrastructure in the first place? (I mean on their own deployment, not CockroachCloud) I acknowledge it's not documented yet, but this is more or less exactly the problem it is meant to solve.

  • wouldn't it make sense to offer these developers accounts on CockroachCloud free tier instead?

@knz
Copy link
Contributor

knz commented Apr 9, 2021

@ajwerner probably a PM or two need to see this (but I don't know which yet)

@ajwerner ajwerner changed the title Polynomial / exponential latency growth for database isolated ("Bridge") multi-tenant models Polynomial latency growth for database isolated ("Bridge") multi-tenant models Apr 21, 2021
@jlinder jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021
@mukuljp
Copy link

mukuljp commented Jul 5, 2021

@aeneasr @ajwerner We are also facing similar issues, In our multitenant system we have 200+ Databases ,
We started with v19.1.5 and performance was better till we migrated it to v20.1.17 .
A query which usually takes half a second in v19.1.5 is taking over minutes to complete in v20.1.17.

Issues

  • high latency
  • slow queries. taking minutes to complete one.(for example a "show tables;" query in any data base is taking a minute or more )
  • memory always above 80 or 90 percent in nodes
  • image
  • Dbeaver client it taking significant amount of time to load data when connected to cluster

Is there any hotfix ,configuration changes that need to be done to make it better?
Thanks

@ajwerner
Copy link
Contributor

ajwerner commented Mar 8, 2022

This should now be fixed in 22.1 🎉

@ajwerner ajwerner closed this as completed Mar 8, 2022
@aeneasr
Copy link
Author

aeneasr commented Mar 8, 2022

:O that's awesome!

@matdehaast
Copy link

@aeneasr I'm curious if you tested this with the new version of cockroach?

@aeneasr
Copy link
Author

aeneasr commented May 12, 2022

We abandoned this approach and unfortunately have no way to reproduce the findings easily from back then :/

@ajwerner
Copy link
Contributor

We've tested it and can confirm that the latency no longer grows linearly (or superlinearly) with the number of objects.

@Fontinalis
Copy link

We abandoned this approach and unfortunately have no way to reproduce the findings easily from back then :/

@aeneasr Can I ask what approach you decided to implement and use that had the required performance? We're facing a similar issue and we're in the design phase, so I would love to hear your input on this matter.

@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

9 participants