-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Description
Use case
clickhouse-local is probably the most powerful logs analysis tool I've ever come across, and it seems to obviate most of the pain with running a full ClickHouse installation (ops work, ETL) without sacrificing much (if any?) performance for offline batch tasks.
For use cases where large amount of logs are stored in object stores like GCS or S3, ability to streaming read from object store and process with clickhouse-local is very desirable: no large expensive VMs running a permanent database with a copy of the original data.
This raises the possibility: can ClickHouse be made totally serverless? It is very cheap to run short S3/GCS scan jobs from Lambda or Cloud Functions. Many organizations already use this pattern, but usually they write custom code to run in Lambda, but ClickHouse already provides for many use cases in a much nicer form. However current ClickHouse binary is much too large to fit in a Lambda ZIP file (current max: 50mb, vs. 300mb+ for current official binaries).
Describe the solution you'd like
Potentially a custom build, or simply some documentation steps (PGO build?), to slim down clickhouse-local to read only from S3/GCS/URLs and enough functionality disabled so it will fit within 50mb. Is it possible? I see lots of template-heavy C++. Perhaps this is why binary is so large, and it won't change :)
Describe alternatives you've considered
Obvious alternative is spinning up a container or spot instance to run a job, but this requires inventing some ops framework for managing the VMs and containers. Lambda functions can be fed e.g. by SNS, auto-scaled according to queue of user queries with zero management.
Additional context
None. Low priority, just an idea, but one I've already tried because it makes so much sense over here.