-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibility to use Parquet for easy S3 backup? #27
Comments
Hi @ruslandoga, yes, I have this in plans and I already made proof of concept which worked quite well, but for a lot of writes I'm not sure yet how to make it efficient. As for backups simplest would be to setup a cron job which would dump duckdb file and upload parquet file to s3. |
I’m curious about what specific efficiency concerns you’re facing. Do you still have the PoC code available? I’d love to take a look and experiment with it! Regarding DuckDB, I’m not entirely sure if it supports streaming writes like this: COPY (stdin) TO 'phoenix_analytics_part_{YYYYMMDD}_{UUID}.parquet.lz4' (FORMAT PARQUET, COMPRESSION LZ4_RAW) However, since Parquet supports writing in row groups, it should technically be feasible, as it would resemble how ClickHouse handles INSERT operations. If DuckDB doesn't support it directly, there might be other options, such as using libraries like https://docs.rs/parquet/latest/parquet/ or https://github.com/jorgecarleitao/parquet2 or even a native Elixir implementation :) Another possibility could be writing to a CSV buffer or a temporary table, and then using DuckDB’s COPY to convert it to Parquet. |
🙌🏻 @ruslandoga, there is duckdb support of More you can research here. Let me know which ideas you going to try and we can work together on solution! Feel free to reach me over email or on X / Telegram. |
I think I'd like more control over catalogs and uploads. I think DuckDB's That's what I'd like to try :) |
👋
This is just a question, not a feature request or issue or anything of the sort :)
I'm just starting to learn DuckDB and I wonder if it's a lot of work to make PhoenixAnalytics work with Parquet files? The upside is that they can be easily backed up to S3 (immutable), are quite storage-efficient (40% smaller according to https://benchmark.clickhouse.com/) and almost as fast (60% slower according to https://benchmark.clickhouse.com/) as DuckDB's custom storage format, and have zero load time. The downside is that they might require compaction, but at the same time that enables TTL.
Or maybe there are other ways to stream / backup DuckDB to S3, like in https://motherduck.com/blog/differential-storage-building-block-for-data-warehouse/ or https://litestream.io?
The text was updated successfully, but these errors were encountered: