Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async clickhouse migrations #5433

Open
srikanthccv opened this issue Jul 5, 2024 · 1 comment
Open

Async clickhouse migrations #5433

srikanthccv opened this issue Jul 5, 2024 · 1 comment
Assignees

Comments

@srikanthccv
Copy link
Member

srikanthccv commented Jul 5, 2024

Any release that involves a schema migration that mutates the old data such as the DROP column, index causes the migrations to fail frequently. This creates a dirty version issue which requires a manual intervention. This approach is not scalable when there are hundreds of tenants and it is not good for our OSS users who don't know how to address the issue. The collector does not get upgraded when the migrator fails.

  • the collector insertions fail because the half-finished schema is not compatible with the old collector version
  • the migrator continues to fail because of the dirty version
  • when we drop the schema_migrations to address the dirty migration it triggers all of the past mutations. For example, the 10th migration of logs drops the original tokenbf index and creates a new index with ngram. Now, say some migration that came after the 10th migration fails. The way we resolve this today is by dropping the schema_migrations table and running them again. Now the 10th migration runs again which drop the index on ngram and recreates on ngram again. The migrator can potentially fail here itself if the time taken to drop the index is beyond 180 seconds.
  • The mutations triggered block even simple DDL queries that are as simple as CREATE DATABASE
  • Then we intervene to kill the mutations which are not guaranteed to be killed immediately making the ingestion affected deterministically.

Our internal instances of failures are known from the recent 0.49.1 but the same happened for 0.47 traces migration too and here are the past instances of community users getting affected because of this.

Copy link

request-info bot commented Jul 5, 2024

We would appreciate it if you could provide us with more info about this issue/pr!

@srikanthccv srikanthccv self-assigned this Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant