Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to use Parquet for easy S3 backup? #27

Open
ruslandoga opened this issue Oct 22, 2024 · 4 comments
Open

Possibility to use Parquet for easy S3 backup? #27

ruslandoga opened this issue Oct 22, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@ruslandoga
Copy link

ruslandoga commented Oct 22, 2024

👋

This is just a question, not a feature request or issue or anything of the sort :)

I'm just starting to learn DuckDB and I wonder if it's a lot of work to make PhoenixAnalytics work with Parquet files? The upside is that they can be easily backed up to S3 (immutable), are quite storage-efficient (40% smaller according to https://benchmark.clickhouse.com/) and almost as fast (60% slower according to https://benchmark.clickhouse.com/) as DuckDB's custom storage format, and have zero load time. The downside is that they might require compaction, but at the same time that enables TTL.

Screenshot 2024-10-22 at 19 20 22

Or maybe there are other ways to stream / backup DuckDB to S3, like in https://motherduck.com/blog/differential-storage-building-block-for-data-warehouse/ or https://litestream.io?

@ruslandoga ruslandoga changed the title Possibility to use Parquet? Possibility to use Parquet for easy S3 backup? Oct 22, 2024
@lalabuy948
Copy link
Owner

Hi @ruslandoga, yes, I have this in plans and I already made proof of concept which worked quite well, but for a lot of writes I'm not sure yet how to make it efficient.

As for backups simplest would be to setup a cron job which would dump duckdb file and upload parquet file to s3.

@ruslandoga
Copy link
Author

ruslandoga commented Oct 23, 2024

👋 @lalabuy948

but for a lot of writes I'm not sure yet how to make it efficient.

I’m curious about what specific efficiency concerns you’re facing. Do you still have the PoC code available? I’d love to take a look and experiment with it!

Regarding DuckDB, I’m not entirely sure if it supports streaming writes like this:

COPY (stdin) TO 'phoenix_analytics_part_{YYYYMMDD}_{UUID}.parquet.lz4' (FORMAT PARQUET, COMPRESSION LZ4_RAW)

However, since Parquet supports writing in row groups, it should technically be feasible, as it would resemble how ClickHouse handles INSERT operations. If DuckDB doesn't support it directly, there might be other options, such as using libraries like https://docs.rs/parquet/latest/parquet/ or https://github.com/jorgecarleitao/parquet2 or even a native Elixir implementation :)

Another possibility could be writing to a CSV buffer or a temporary table, and then using DuckDB’s COPY to convert it to Parquet.

@lalabuy948 lalabuy948 self-assigned this Oct 23, 2024
@lalabuy948
Copy link
Owner

🙌🏻 @ruslandoga, there is duckdb support of httpfs, you can take a look, some sketches saved there: priv/repo/duck_s3.exs

More you can research here.

Let me know which ideas you going to try and we can work together on solution! Feel free to reach me over email or on X / Telegram.

@lalabuy948 lalabuy948 added the enhancement New feature or request label Oct 23, 2024
@ruslandoga
Copy link
Author

ruslandoga commented Oct 23, 2024

Let me know which ideas you going to try and we can work together on solution!

I think I'd like more control over catalogs and uploads. I think DuckDB's httpfs tries to do a ListObjectsV2 on each read_parquet from S3 if it has any wildcards, and that can get expensive. And Hive-style partitioning might be suboptimal (from my limited experience). So I would probably try writing Parquet to disk (possibly not through DuckDB) and uploading it to S3 from Elixir (e.g. every ~20 minutes, that would be around free-tier number of PUTs). It can be done by a single process. Since it would have the full view of which files are available and where (S3 or disk), it can also act as a table catalog with some support for predicate pushdowns (e.g. the user wants to see analytics for a site for the last month, so the catalog only gives the last month worth of that site's Parquet files to DuckDB). That process could also perform compaction (e.g. compact the previous day's Parquet files into a single one) since it could hide incomplete parts from DuckDB.

That's what I'd like to try :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants