-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap 2022 (discussion) #32513
Comments
Pls don't mind me here. |
@ramazanpolat you can do it via "subscribe" button |
What would the embedded Clickhouse engine look like - would it involve self-contained DB files and instances like DuckDB? This would be pretty great, it would make Clickhouse a good choice for one-off, self-contained projects. |
Yes, something like clickhouse-local but embedded in Python module and with some additional support for dataframes. Pretty similar to DuckDB :) Also should leverage "Type Inference for Data Import". PS. |
@alexey-milovidov thanks! I'm aware of Clickhouse local. One advantage of DuckDB over Clickhouse is that I need to work with Parquet files, ad hoc queries are impossible with schema inference requirements. So I'm also excited for that. Will the embedded module also be designed to work well with R? Don't discriminate against us R users please :( |
Yes, Python first and R is next. This task is under "experimental" category, so it will started to be implemented by developer outside of the main team (by @LGrishin). Usually for the tasks from that category we have some prototype available in summer. |
Hi @alexey-milovidov |
@yiguolei It is named Resource Management in the roadmap. Yes, it also was for 2021 year, but we have only started implementing it:
So, most of the work is expected in 2022. |
@alexey-milovidov What about concurrency management. I think there are too many threads during many concurrent queries. Any progress on this? |
It is going to be solved by one of the subtasks - common data processing pipeline for server, the task is being implemented by @KochetovNicolai |
How about user defined aggregation function? Or user defined table function like Snowflake: https://docs.snowflake.com/en/developer-guide/udf/java/udf-java-tabular-functions.html |
Also this big task |
We already have user defined table functions, since version 21.10. For user defined aggregate function - it's more difficult, will see...
This is №1 in: SQL Compatibility Improvements
|
@alexey-milovidov Thanks for your explanation and I'm looking forward to those new changes of ClickHouse and hoping it to be better! |
Is this the limit for cannot using recursive UDF now? |
Wondering if this is somehow related to know KSQL does things, e,g:
It found quite hard to work with streaming-like data, especially when working with streams that need to be joined with themselves. Mat views are a way to do it but they don't support to be self joined. You can do with Null table hacks, but still a problem with the matview insertion order (well, the lack of). Not sure if there is a better way to do this you are thinking about. |
You can look into: #8331 It's already merged. |
@cmsxbc Yes, recursive SQL UDFs are difficult, most likely we will not be able to support them in near months. |
Would |
Materialized MySQL supports table filtering instead of all tables in the database |
Maybe we can implement |
@UnamedRus thanks, do you have a "real world" example. The PR lacks of documentation right now so it'd be nice to see how you are using it |
@javisantana WindowView documents are added here |
@Vxider thanks, referring to my original comment, I don't see how this window MV solves the problem of generating MV from stream data that need to join with itself (or other streams) to get the previous state of incoming entities. (more in this line -> https://calcite.apache.org/docs/stream.html#joining-streams-to-streams ) |
That sounds good. I'm looking forward to the release. |
@alexey-milovidov hi,Is there any design document or PR about Parallel reading from replicas。 |
1.SQL functions for compatibility with MySQL,how about the progress,full compatible? |
No, we don't have any schedule for full compatibility.
ClickHouse is not being able to run full TPC-DS. |
@alexey-milovidov can you share a tl;dr; of why parallel replica reading does not work? Any lessons learned? |
@nvartolomei It is unclear to me, we need to make a setup and debug it. |
is support Lightweight update like Lightweight DELETE. |
@vkingnew we will consider this task only after finishing the implementation of Lightweight DELETE. |
@alexey-milovidov If this feature's priority is relatively low for you. i.e it won't be implemented in the short future. |
2023 Roadmap: #44767 |
@nvartolomei Here is a follow-up for parallel replicas: #43772 |
@vkingnew Currently, ClickHouse can run 92 of 99 TPC-DS queries with modifications. |
@alexey-milovidov So, does this feature will be open source in future? |
@ucasfl I can say that it is currently being developed in the private repo, and it is developed using the specifics of ClickHouse Cloud. There are chances we will simply merge it to the main repository, but there are also chances it will remain isolated. |
@ramseyxu Implementation of Shared Storage (HDFS & S3) & Metadata (FDB) in ClickHouse fork by bytedance: https://github.com/ByConity/ByConity |
Inspired by @alexey-milovidov, I have implemented a Python embedded clickhouse engine |
@auxten, this is great! We can incorporate the required changes to the build scripts and automate them in CI. |
@nvartolomei At this moment in time, parallel replicas are very close to production readiness. |
@stigsb Hello! I am glad to read that active work is underway! Were you able to release your work? Can we expect this feature to be production ready? |
We are just finishing upgrading our fork to 23.3, which makes it possible for us to start breaking out PRs. We intend to release all of it during the summer. |
You can also try this out. We use it in production with MySQL. The same software works with Postgres as it uses Debezium. https://github.com/Altinity/clickhouse-sink-connector It supports most DDL and large databases, it is tested with tools like gh-ost (MySQL). |
@alexey-milovidov - Hi any permanent setting to disable memory overcommit both hard and soft ? tried this in the query |
This is ClickHouse open-source roadmap 2022.
Descriptions and links to be filled.
This roadmap does not cover the tasks related to infrastructure, orchestration, documentation, marketing, integrations, SaaS, drivers, etc.
See also:
Roadmap 2021: #17623
Roadmap 2020: in Russian
Main Tasks
✔️ Make clickhouse-keeper Production Ready
✔️ It is already feature-complete and being used in production.
✔️ Update documentation to replace ZooKeeper with clickhouse-keeper everywhere.
✔️ Support for Backup and Restore
✔️ Backup of tables, databases, servers and clusters.
✔️ Incremental backups. Support for partial restore.
✔️ Support for pluggable backup storage options.
✔️ Semistructured Data
✔️ JSON data type with automatic type inference and dynamic subcolumns.
✔️ Sparse column format and optimization of functions for sparse columns. #22535
Dynamic selection of column format - full, const, sparse, low cardinality.
✔️ Hybrid wide/compact data part format for huge number of columns.
✔️ Type Inference for Data Import
✔️ Allow to skip column names and types if data format already contains schema (e.g. Parquet, Avro).
✔️ Allow to infer types for text formats (e.g. CSV, TSV, JSONEachRow).
#32455
Support for Transactions
Atomic insert of more than one block or to more than one partition into MergeTree and ReplicatedMergeTree tables.
Atomic insert into table and dependent materialized views. Atomic insert into multiple tables.
Multiple SELECTs from one consistent snapshot.
Atomic insert into distributed table.
✔️ Lightweight DELETE
✔️ Make mutations more lightweight by using delete-masks.
✔️ It won't enable frequent UPDATE/DELETE like in OLTP databases, but will make it more close.
SQL Compatibility Improvements
✔️ Untangle name resolution and query analysis.
Initial support for correlated subqueries.
✔️ Allow using window functions inside expressions.
✔️ Add compatibility aliases for some window functions, etc.
✔️ Support for GROUPING SETS.
JOIN Improvements
Support for join reordering.
Extend the cases when condition pushdown is applicable.
Convert anti-join to NOT IN.
✔️ Use table sorting for DISTINCT optimization.
✔️ Use table sorting for merge JOIN.
✔️ Grace hash join algorithm.
Resource Management
✔️ Memory overcommit (sort and hard memory limits).
Enable external GROUP BY and ORDER BY by default.
✔️ IO operations scheduler with priorities.
✔️ Make scalar subqueries accountable.
CPU and network priorities.
Separation of Storage and Compute
✔️ Parallel reading from replicas.
✔️ Dynamic cluster configuration with service discovery.
✔️ Caching of data from object storage.
Simplification of ReplicatedMergeTree.
✔️ Shared metadata storage.
Experimental and Intern Tasks
Streaming Queries
Fix POPULATE for materialized views.
Unification of materialized views, live views and window views.
Allow to set up subscriptions on top of all tables including Merge, Distributed.
✔️ Normalization of Kafka tables with storing offsets in ClickHouse.
Support for exactly once consumption from Kafka, non-consuming reads and multiple consumers.
Streaming queries with GROUP BY, ORDER BY with windowing criterias.
Persistent queues on top of ClickHouse tables.
Integration with ML/AI
🗑️ Integration with Tensorflow
🗑️ Integration with MADLib
GPU Support
🗑️ Compile expressions to GPU
Unique Key Constraint
User-Defined Data Types
Incremental aggregation in memory
Key-value data marts
Text Classification
Graph Processing
Foreign SQL Dialects in ClickHouse
🗑️ Support for MySQL dialect or Apache Calcite as an option.
✔️ Batch Jobs and Refreshable Materialized Views
✔️ Embedded ClickHouse Engine
Data Hub
Build And Testing Improvements
Testing
✔️ Add tests for AArch64 builds.
✔️ Automated tests for backward compatibility.
Server-side query fuzzer for all kind of tests.
✔️ Fuzzing of query settings in functional tests.
SQL function-based fuzzer.
Fuzzer of data formats.
Integrate with SQLogicTest.
Import obfuscated queries from Yandex Metrica.
Builds
✔️ Docker images for AArch64.
✔️ Enable missing libraries for AArch64 builds.
✔️ Add and explore Musl builds.
Build all libraries with our own CMake files.
Embed root certificates to the binary.
Embed DNS resolver to the binary.
Add ClickHouse to Snap, so people will not install obsolete versions by accident.
The text was updated successfully, but these errors were encountered: