Possibility to use Parquet for easy S3 backup? #27

ruslandoga · 2024-10-22T11:16:41Z

👋

This is just a question, not a feature request or issue or anything of the sort :)

I'm just starting to learn DuckDB and I wonder if it's a lot of work to make PhoenixAnalytics work with Parquet files? The upside is that they can be easily backed up to S3 (immutable), are quite storage-efficient (40% smaller according to https://benchmark.clickhouse.com/) and almost as fast (60% slower according to https://benchmark.clickhouse.com/) as DuckDB's custom storage format, and have zero load time. The downside is that they might require compaction, but at the same time that enables TTL.

Or maybe there are other ways to stream / backup DuckDB to S3, like in https://motherduck.com/blog/differential-storage-building-block-for-data-warehouse/ or https://litestream.io?

lalabuy948 · 2024-10-23T08:17:21Z

Hi @ruslandoga, yes, I have this in plans and I already made proof of concept which worked quite well, but for a lot of writes I'm not sure yet how to make it efficient.

As for backups simplest would be to setup a cron job which would dump duckdb file and upload parquet file to s3.

ruslandoga · 2024-10-23T08:30:15Z

👋 @lalabuy948

but for a lot of writes I'm not sure yet how to make it efficient.

I’m curious about what specific efficiency concerns you’re facing. Do you still have the PoC code available? I’d love to take a look and experiment with it!

Regarding DuckDB, I’m not entirely sure if it supports streaming writes like this:

COPY (stdin) TO 'phoenix_analytics_part_{YYYYMMDD}_{UUID}.parquet.lz4' (FORMAT PARQUET, COMPRESSION LZ4_RAW)

However, since Parquet supports writing in row groups, it should technically be feasible, as it would resemble how ClickHouse handles INSERT operations. If DuckDB doesn't support it directly, there might be other options, such as using libraries like https://docs.rs/parquet/latest/parquet/ or https://github.com/jorgecarleitao/parquet2 or even a native Elixir implementation :)

Another possibility could be writing to a CSV buffer or a temporary table, and then using DuckDB’s COPY to convert it to Parquet.

lalabuy948 · 2024-10-23T13:53:03Z

🙌🏻 @ruslandoga, there is duckdb support of httpfs, you can take a look, some sketches saved there: priv/repo/duck_s3.exs

More you can research here.

Let me know which ideas you going to try and we can work together on solution! Feel free to reach me over email or on X / Telegram.

ruslandoga · 2024-10-23T15:34:13Z

Let me know which ideas you going to try and we can work together on solution!

I think I'd like more control over catalogs and uploads. I think DuckDB's httpfs tries to do a ListObjectsV2 on each read_parquet from S3 if it has any wildcards, and that can get expensive. And Hive-style partitioning might be suboptimal (from my limited experience). So I would probably try writing Parquet to disk (possibly not through DuckDB) and uploading it to S3 from Elixir (e.g. every ~20 minutes, that would be around free-tier number of PUTs). It can be done by a single process. Since it would have the full view of which files are available and where (S3 or disk), it can also act as a table catalog with some support for predicate pushdowns (e.g. the user wants to see analytics for a site for the last month, so the catalog only gives the last month worth of that site's Parquet files to DuckDB). That process could also perform compaction (e.g. compact the previous day's Parquet files into a single one) since it could hide incomplete parts from DuckDB.

That's what I'd like to try :)

ruslandoga changed the title ~~Possibility to use Parquet?~~ Possibility to use Parquet for easy S3 backup? Oct 22, 2024

lalabuy948 self-assigned this Oct 23, 2024

lalabuy948 added the enhancement New feature or request label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to use Parquet for easy S3 backup? #27

Possibility to use Parquet for easy S3 backup? #27

ruslandoga commented Oct 22, 2024 •

edited

Loading

lalabuy948 commented Oct 23, 2024

ruslandoga commented Oct 23, 2024 •

edited

Loading

lalabuy948 commented Oct 23, 2024

ruslandoga commented Oct 23, 2024 •

edited

Loading

Possibility to use Parquet for easy S3 backup? #27

Possibility to use Parquet for easy S3 backup? #27

Comments

ruslandoga commented Oct 22, 2024 • edited Loading

lalabuy948 commented Oct 23, 2024

ruslandoga commented Oct 23, 2024 • edited Loading

lalabuy948 commented Oct 23, 2024

ruslandoga commented Oct 23, 2024 • edited Loading

ruslandoga commented Oct 22, 2024 •

edited

Loading

ruslandoga commented Oct 23, 2024 •

edited

Loading

ruslandoga commented Oct 23, 2024 •

edited

Loading