Roadmap 2022 (discussion) #32513

alexey-milovidov · 2021-12-10T15:26:49Z

This is ClickHouse open-source roadmap 2022.
Descriptions and links to be filled.

This roadmap does not cover the tasks related to infrastructure, orchestration, documentation, marketing, integrations, SaaS, drivers, etc.

Main Tasks

✔️ Make clickhouse-keeper Production Ready

✔️ It is already feature-complete and being used in production.
✔️ Update documentation to replace ZooKeeper with clickhouse-keeper everywhere.

✔️ Support for Backup and Restore

✔️ Backup of tables, databases, servers and clusters.
✔️ Incremental backups. Support for partial restore.
✔️ Support for pluggable backup storage options.

✔️ Semistructured Data

✔️ JSON data type with automatic type inference and dynamic subcolumns.
✔️ Sparse column format and optimization of functions for sparse columns. #22535
Dynamic selection of column format - full, const, sparse, low cardinality.
✔️ Hybrid wide/compact data part format for huge number of columns.

✔️ Type Inference for Data Import

✔️ Allow to skip column names and types if data format already contains schema (e.g. Parquet, Avro).
✔️ Allow to infer types for text formats (e.g. CSV, TSV, JSONEachRow).

#32455

Support for Transactions

Atomic insert of more than one block or to more than one partition into MergeTree and ReplicatedMergeTree tables.
Atomic insert into table and dependent materialized views. Atomic insert into multiple tables.
Multiple SELECTs from one consistent snapshot.
Atomic insert into distributed table.

✔️ Lightweight DELETE

✔️ Make mutations more lightweight by using delete-masks.
✔️ It won't enable frequent UPDATE/DELETE like in OLTP databases, but will make it more close.

SQL Compatibility Improvements

✔️ Untangle name resolution and query analysis.
Initial support for correlated subqueries.
✔️ Allow using window functions inside expressions.
✔️ Add compatibility aliases for some window functions, etc.
✔️ Support for GROUPING SETS.

JOIN Improvements

Support for join reordering.
Extend the cases when condition pushdown is applicable.
Convert anti-join to NOT IN.
✔️ Use table sorting for DISTINCT optimization.
✔️ Use table sorting for merge JOIN.
✔️ Grace hash join algorithm.

Resource Management

✔️ Memory overcommit (sort and hard memory limits).
Enable external GROUP BY and ORDER BY by default.
✔️ IO operations scheduler with priorities.
✔️ Make scalar subqueries accountable.
CPU and network priorities.

Separation of Storage and Compute

✔️ Parallel reading from replicas.
✔️ Dynamic cluster configuration with service discovery.
✔️ Caching of data from object storage.
Simplification of ReplicatedMergeTree.
✔️ Shared metadata storage.

Experimental and Intern Tasks

Streaming Queries

Fix POPULATE for materialized views.
Unification of materialized views, live views and window views.
Allow to set up subscriptions on top of all tables including Merge, Distributed.
✔️ Normalization of Kafka tables with storing offsets in ClickHouse.
Support for exactly once consumption from Kafka, non-consuming reads and multiple consumers.
Streaming queries with GROUP BY, ORDER BY with windowing criterias.
Persistent queues on top of ClickHouse tables.

Integration with ML/AI

🗑️ Integration with Tensorflow
🗑️ Integration with MADLib

GPU Support

🗑️ Compile expressions to GPU

Unique Key Constraint

User-Defined Data Types

Incremental aggregation in memory

Key-value data marts

Text Classification

Graph Processing

Foreign SQL Dialects in ClickHouse

🗑️ Support for MySQL dialect or Apache Calcite as an option.

✔️ Batch Jobs and Refreshable Materialized Views

✔️ Embedded ClickHouse Engine

Data Hub

Build And Testing Improvements

Testing

✔️ Add tests for AArch64 builds.
✔️ Automated tests for backward compatibility.
Server-side query fuzzer for all kind of tests.
✔️ Fuzzing of query settings in functional tests.
SQL function-based fuzzer.
Fuzzer of data formats.
Integrate with SQLogicTest.
Import obfuscated queries from Yandex Metrica.

Builds

✔️ Docker images for AArch64.
✔️ Enable missing libraries for AArch64 builds.
✔️ Add and explore Musl builds.
Build all libraries with our own CMake files.
Embed root certificates to the binary.
Embed DNS resolver to the binary.
Add ClickHouse to Snap, so people will not install obsolete versions by accident.

ramazanpolat · 2021-12-10T16:30:35Z

Pls don't mind me here.
I'm just reserving my spot for update notifications.

Slach · 2021-12-10T17:23:38Z

@ramazanpolat you can do it via "subscribe" button

alanpaulkwan · 2021-12-10T20:43:27Z

What would the embedded Clickhouse engine look like - would it involve self-contained DB files and instances like DuckDB? This would be pretty great, it would make Clickhouse a good choice for one-off, self-contained projects.

alexey-milovidov · 2021-12-10T20:48:31Z

@alanpaulkwan

Yes, something like clickhouse-local but embedded in Python module and with some additional support for dataframes. Pretty similar to DuckDB :) Also should leverage "Type Inference for Data Import".

PS. clickhouse-local already does most of this. With recent ClickHouse versions, if I need to check some queries quickly, I just type clickhouse-local and create tables and run queries interactively.

alanpaulkwan · 2021-12-10T20:54:44Z

@alexey-milovidov thanks! I'm aware of Clickhouse local. One advantage of DuckDB over Clickhouse is that I need to work with Parquet files, ad hoc queries are impossible with schema inference requirements. So I'm also excited for that.

Will the embedded module also be designed to work well with R? Don't discriminate against us R users please :(

alexey-milovidov · 2021-12-10T20:59:36Z

Yes, Python first and R is next. This task is under "experimental" category, so it will started to be implemented by developer outside of the main team (by @LGrishin). Usually for the tasks from that category we have some prototype available in summer.

yiguolei · 2021-12-13T01:12:36Z

Hi @alexey-milovidov
I find there is a plan for Workload management to deal with concurrency issues in 2021. But it disappears in 2022, why not finish it?

alexey-milovidov · 2021-12-13T02:17:14Z

@yiguolei It is named Resource Management in the roadmap.

Yes, it also was for 2021 year, but we have only started implementing it:

(done) an interface for IO schedulers;
(done) removing DataStreams in favor of Processors;
(in progress) memory overcommit with soft/hard limits;

So, most of the work is expected in 2022.

yiguolei · 2021-12-13T02:51:12Z

@alexey-milovidov What about concurrency management. I think there are too many threads during many concurrent queries. Any progress on this?

alexey-milovidov · 2021-12-13T03:58:45Z

It is going to be solved by one of the subtasks - common data processing pipeline for server, the task is being implemented by @KochetovNicolai

Zhile · 2021-12-13T05:38:51Z

How about user defined aggregation function? Or user defined table function like Snowflake: https://docs.snowflake.com/en/developer-guide/udf/java/udf-java-tabular-functions.html
Which can help users to process blocks of data and output only one row result.

Zhile · 2021-12-13T05:46:17Z

Also this big task
#23194

alexey-milovidov · 2021-12-13T06:07:29Z

@Zhile

How about user defined aggregation function? Or user defined table function

We already have user defined table functions, since version 21.10.
They allow custom data generation, transformation, aggregation and even joining with user-defined programs.
See https://presentations.clickhouse.com/meetup56/new_features/

For user defined aggregate function - it's more difficult, will see...

Also this big task #23194

This is №1 in:

SQL Compatibility Improvements

Untangle name resolution and query analysis.

Zhile · 2021-12-13T06:36:16Z

@alexey-milovidov Thanks for your explanation and I'm looking forward to those new changes of ClickHouse and hoping it to be better!

cmsxbc · 2021-12-13T09:29:25Z

@alexey-milovidov

SQL Compatibility Improvements
Untangle name resolution and query analysis

Is this the limit for cannot using recursive UDF now?

javisantana · 2021-12-13T15:26:16Z

Streaming queries with GROUP BY, ORDER BY with windowing criteria.

Wondering if this is somehow related to know KSQL does things, e,g:

SELECT ...
FROM orders o
        INNER JOIN payments p WITHIN 1 HOURS ON p.id = o.id

It found quite hard to work with streaming-like data, especially when working with streams that need to be joined with themselves. Mat views are a way to do it but they don't support to be self joined. You can do with Null table hacks, but still a problem with the matview insertion order (well, the lack of).

Not sure if there is a better way to do this you are thinking about.

UnamedRus · 2021-12-13T15:48:18Z

@javisantana

You can look into: #8331

It's already merged.

alexey-milovidov · 2021-12-13T17:52:33Z

@cmsxbc Yes, recursive SQL UDFs are difficult, most likely we will not be able to support them in near months.

bputt · 2021-12-14T04:13:23Z

Would Unique Key Constraint allow us to no longer worry about preventing duplicates from being inserted?

kiwimg · 2021-12-14T05:56:40Z

Materialized MySQL supports table filtering instead of all tables in the database

Vxider · 2021-12-15T02:26:12Z

Maybe we can implement Batch Jobs and Refreshable Materialized Views using time window functions? I think the difference between a batch job in Materialized Views and a streaming job in Window Views is whether to calculate and store the intermediate states. We can implement the batch job by removing the calculation of the intermediate state in Window View, and just use the processing time with the time window function to trigger windows.

bkuschel · 2021-12-16T19:16:01Z

Could these be feasible?

Array Join: #8687
Full text search: #19970

simpl1g · 2021-12-21T12:34:55Z

Any work on Subpartition/Dynamic Partition planned?

#8089
#13826
#16565
#18695

javisantana · 2021-12-22T09:29:48Z

@UnamedRus thanks, do you have a "real world" example. The PR lacks of documentation right now so it'd be nice to see how you are using it

Vxider · 2021-12-22T09:37:42Z

@javisantana WindowView documents are added here

javisantana · 2021-12-22T09:58:53Z

@Vxider thanks, referring to my original comment, I don't see how this window MV solves the problem of generating MV from stream data that need to join with itself (or other streams) to get the previous state of incoming entities. (more in this line -> https://calcite.apache.org/docs/stream.html#joining-streams-to-streams )

XuJia0210 · 2022-10-08T08:00:07Z

#40419

That sounds good. I'm looking forward to the release.

shadowDy · 2022-10-09T09:46:37Z

@alexey-milovidov hi，Is there any design document or PR about Parallel reading from replicas。

vkingnew · 2022-11-05T07:17:37Z

1.SQL functions for compatibility with MySQL，how about the progress，full compatible？
2. the latest clickhouse verSion tested by TPC-DS ？

alexey-milovidov · 2022-11-06T21:25:14Z

@shadowDy Here is is: #26748
Unfortunately, the implementation has failed - it simply does not work.

alexey-milovidov · 2022-11-06T21:26:15Z

@vkingnew

SQL functions for compatibility with MySQL，how about the progress，full compatible？

No, we don't have any schedule for full compatibility.

the latest clickhouse verSion tested by TPC-DS ？

ClickHouse is not being able to run full TPC-DS.

nvartolomei · 2022-11-06T21:46:14Z

@alexey-milovidov can you share a tl;dr; of why parallel replica reading does not work? Any lessons learned?

alexey-milovidov · 2022-11-06T22:13:43Z

@nvartolomei It is unclear to me, we need to make a setup and debug it.

vkingnew · 2022-11-08T17:47:10Z

is support Lightweight update like Lightweight DELETE.
https://clickhouse.com/docs/en/sql-reference/statements/delete

alexey-milovidov · 2022-11-09T22:46:02Z

@vkingnew we will consider this task only after finishing the implementation of Lightweight DELETE.

XuJia0210 · 2022-11-28T02:11:21Z

@alexey-milovidov
Could you share the update of Shared metadata storage.
Is there an estimated release date?
We are looking forward to use it, even it's experimental

If this feature's priority is relatively low for you. i.e it won't be implemented in the short future.
Our team is willing to do it.
Actually, we planned to do it in 2022 Q4 and stoped after seeing ClickHouse community will do.

alexey-milovidov · 2022-12-31T15:22:19Z

@ramseyxu

SharedMergeTree and Shared metadata storage are in development by @alesapin and @davenger. They are also listed in the roadmap for 2023: #44767

These features are very deeply integrated with ClickHouse Cloud, and the development happens in the private repository.

alexey-milovidov · 2022-12-31T15:22:39Z

2023 Roadmap: #44767

alexey-milovidov · 2022-12-31T15:24:07Z

@nvartolomei Here is a follow-up for parallel replicas: #43772
The initial implementation did not work well due to the uneven distribution of work between replicas.

alexey-milovidov · 2022-12-31T15:25:36Z

@vkingnew Currently, ClickHouse can run 92 of 99 TPC-DS queries with modifications.

ucasfl · 2023-01-01T05:48:36Z

@ramseyxu

SharedMergeTree and Shared metadata storage are in development by @alesapin and @davenger. They are also listed in the roadmap for 2023: #44767

These features are very deeply integrated with ClickHouse Cloud, and the development happens in the private repository.

@alexey-milovidov So, does this feature will be open source in future?

alexey-milovidov · 2023-01-01T15:04:19Z

@ucasfl I can say that it is currently being developed in the private repo, and it is developed using the specifics of ClickHouse Cloud. There are chances we will simply merge it to the main repository, but there are also chances it will remain isolated.

UnamedRus · 2023-01-12T00:26:05Z

@ramseyxu

Implementation of Shared Storage (HDFS & S3) & Metadata (FDB) in ClickHouse fork by bytedance: https://github.com/ByConity/ByConity

auxten · 2023-04-18T07:54:05Z

@alanpaulkwan

Yes, something like clickhouse-local but embedded in Python module and with some additional support for dataframes. Pretty similar to DuckDB :) Also should leverage "Type Inference for Data Import".

PS. clickhouse-local already does most of this. With recent ClickHouse versions, if I need to check some queries quickly, I just type clickhouse-local and create tables and run queries interactively.

Inspired by @alexey-milovidov, I have implemented a Python embedded clickhouse engine
Please check chDB

alexey-milovidov · 2023-04-18T19:57:44Z

@auxten, this is great! We can incorporate the required changes to the build scripts and automate them in CI.
Also interesting to see comparisons from a usability perspective - how much can be done with this module and how well it integrates with Pandas, Numpy, etc...

alexey-milovidov · 2023-04-18T20:02:28Z

@nvartolomei At this moment in time, parallel replicas are very close to production readiness.
They work well for small and big queries, scale well to over 100 replicas, and over 1 Tbit/sec of data read from S3.
There are still some things that we want to improve before stating it as production ready, e.g., the ability to continue if a replica went down during the query, automatically choosing the cluster from Distributed table, and choosing the method of work depending on the query type.

Kaspiman · 2023-06-07T04:19:00Z

@rafael81 MaterializedMySQL is implemented for ~90%, not 100%, and it is still experimental.
There are many complaints about this feature and today no one is working on it. (Previous contributors have vanished)
The status is unclear. We will try to find a way to revive it, but there is no guarantee that it will happen.

My team has several engineers working on MaterializedMySQL full time, we have made substantial performance and robustness improvements and intend to contribute most of it by the end of the year.

@stigsb Hello! I am glad to read that active work is underway! Were you able to release your work? Can we expect this feature to be production ready?

stigsb · 2023-06-07T05:27:43Z

@rafael81 MaterializedMySQL is implemented for ~90%, not 100%, and it is still experimental.
There are many complaints about this feature and today no one is working on it. (Previous contributors have vanished)
The status is unclear. We will try to find a way to revive it, but there is no guarantee that it will happen.

My team has several engineers working on MaterializedMySQL full time, we have made substantial performance and robustness improvements and intend to contribute most of it by the end of the year.

@stigsb Hello! I am glad to read that active work is underway! Were you able to release your work? Can we expect this feature to be production ready?

We are just finishing upgrading our fork to 23.3, which makes it possible for us to start breaking out PRs. We intend to release all of it during the summer.

aadant · 2023-06-08T01:59:39Z

You can also try this out. We use it in production with MySQL. The same software works with Postgres as it uses Debezium.

https://github.com/Altinity/clickhouse-sink-connector

It supports most DDL and large databases, it is tested with tools like gh-ost (MySQL).
Postgres replication is currently limited to data replication (a limitation of Postgres logical replication).

sbalasa · 2024-02-26T19:04:25Z

@alexey-milovidov - Hi any permanent setting to disable memory overcommit both hard and soft ? tried this in the query memory_overcommit_ratio_denominator_for_user=0 but still hits

alexey-milovidov added the feature label Dec 10, 2021

alexey-milovidov pinned this issue Dec 10, 2021

alexey-milovidov added the comp-documentation Documentation label Dec 10, 2021

alexey-milovidov mentioned this issue Dec 10, 2021

Roadmap 2021 (discussion) #17623

Closed

alexey-milovidov mentioned this issue Dec 30, 2022

Roadmap 2023 #44767

Closed

alexey-milovidov unpinned this issue Dec 30, 2022

alexey-milovidov closed this as completed Dec 31, 2022

auxten mentioned this issue Jun 9, 2023

Clickhouse stable ABI and shared object build for hackings like chdb #50750

Closed

alexey-milovidov mentioned this issue Dec 31, 2023

Roadmap 2024 (discussion) #58392

Closed

alexey-milovidov mentioned this issue Dec 31, 2024

Roadmap 2025 (discussion) #74046

Open

72 tasks

Roadmap 2022 (discussion) #32513

Roadmap 2022 (discussion) #32513

Comments

alexey-milovidov commented Dec 10, 2021 • edited Loading

Main Tasks

✔️ Make clickhouse-keeper Production Ready

✔️ Support for Backup and Restore

✔️ Semistructured Data

✔️ Type Inference for Data Import

Support for Transactions

✔️ Lightweight DELETE

SQL Compatibility Improvements

JOIN Improvements

Resource Management

Separation of Storage and Compute

Experimental and Intern Tasks

Streaming Queries

Integration with ML/AI

GPU Support

Unique Key Constraint

User-Defined Data Types

Incremental aggregation in memory

Key-value data marts

Text Classification

Graph Processing

Foreign SQL Dialects in ClickHouse

✔️ Batch Jobs and Refreshable Materialized Views

✔️ Embedded ClickHouse Engine

Data Hub

Build And Testing Improvements

Testing

Builds

ramazanpolat commented Dec 10, 2021

Slach commented Dec 10, 2021

alanpaulkwan commented Dec 10, 2021

alexey-milovidov commented Dec 10, 2021 • edited Loading

alanpaulkwan commented Dec 10, 2021

alexey-milovidov commented Dec 10, 2021

yiguolei commented Dec 13, 2021

alexey-milovidov commented Dec 13, 2021 • edited Loading

yiguolei commented Dec 13, 2021 • edited Loading

alexey-milovidov commented Dec 13, 2021

Zhile commented Dec 13, 2021

Zhile commented Dec 13, 2021

alexey-milovidov commented Dec 13, 2021 • edited Loading

SQL Compatibility Improvements

Zhile commented Dec 13, 2021 • edited Loading

cmsxbc commented Dec 13, 2021 • edited Loading

javisantana commented Dec 13, 2021

UnamedRus commented Dec 13, 2021

alexey-milovidov commented Dec 13, 2021

bputt commented Dec 14, 2021

kiwimg commented Dec 14, 2021

Vxider commented Dec 15, 2021

bkuschel commented Dec 16, 2021

simpl1g commented Dec 21, 2021

javisantana commented Dec 22, 2021

Vxider commented Dec 22, 2021

javisantana commented Dec 22, 2021 • edited Loading

XuJia0210 commented Oct 8, 2022

shadowDy commented Oct 9, 2022

vkingnew commented Nov 5, 2022

alexey-milovidov commented Nov 6, 2022

alexey-milovidov commented Nov 6, 2022

nvartolomei commented Nov 6, 2022 • edited Loading

alexey-milovidov commented Nov 6, 2022

vkingnew commented Nov 8, 2022

alexey-milovidov commented Nov 9, 2022

XuJia0210 commented Nov 28, 2022 • edited Loading

alexey-milovidov commented Dec 31, 2022

alexey-milovidov commented Dec 31, 2022

alexey-milovidov commented Dec 31, 2022

alexey-milovidov commented Dec 31, 2022

ucasfl commented Jan 1, 2023

alexey-milovidov commented Jan 1, 2023

UnamedRus commented Jan 12, 2023

auxten commented Apr 18, 2023

alexey-milovidov commented Apr 18, 2023

alexey-milovidov commented Apr 18, 2023

Kaspiman commented Jun 7, 2023

alexey-milovidov commented Dec 10, 2021 •

edited

Loading

alexey-milovidov commented Dec 10, 2021 •

edited

Loading

alexey-milovidov commented Dec 13, 2021 •

edited

Loading

yiguolei commented Dec 13, 2021 •

edited

Loading

alexey-milovidov commented Dec 13, 2021 •

edited

Loading

Zhile commented Dec 13, 2021 •

edited

Loading

cmsxbc commented Dec 13, 2021 •

edited

Loading

javisantana commented Dec 22, 2021 •

edited

Loading

nvartolomei commented Nov 6, 2022 •

edited

Loading

XuJia0210 commented Nov 28, 2022 •

edited

Loading