-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars Based Compute Engine #7067
Comments
Hey @asheeshgarg Thanks for raising this! Integrating with Polars would be awesome. Currently, we can load data into an Arrow table and convert that in every Arrow based backed (including Polars). As a short-term fix, we can add next to And on a longer term. Unfortunately, we don't yet support pushing down the predicate from a PyArrow dataset directly all the way down to Iceberg. Once that is done, we can also easily add this to Polars. (Disclaimer: I'm more familiar with PyArrow at the moment, therefore the PyArrow concepts). I do think Polars and Iceberg would be an awesome combination as Polars is also lazy by design, and you would only open the Parquet files that you actually need for your query. |
@Fokko thanks for the inputs for now what I am doing is using Pyiceberg API to to create scan and pass possible level of expression to reduce the datasets files and create a lazy dataframe using Polars and run filters again lazily on dataframe expressions for the query. |
@asheeshgarg Ah nice, that works, but has some caveats that you need to be aware of. Iceberg tracks the columns by ID's instead of names. For example, if you rename a column, we do this on the table schema. When we read in the files, and we encounter a file that has the old column name, we update the name based on the ID of the column. Also, things likes deletes. This makes it quite an effort to implement Iceberg to engines like Polars as well (mostly because there is no rust implementation yet). With the upcoming 0.4.0 version we'll get even more performance because now we also have metrics evaluation (skipping Parquet files based on the upper- and lower bounds) and also positional deletes. I suggested creating a Polars dataframe from an Arrow table because then you'll get things like the projection and deletes for free :) |
@Fokko sure schema evolution is definitely a challenge. Interesting to know 0.4.0 support for the partitioning pruning using metadata. Do we have PR where we can see this. |
Hi all, Polars contributor here. I did the integration for DeltaIO recently :) However, it would be great to support the lazy Once this addition is in place, I can open a PR to support iceberg on Polars side. |
Thanks @chitralverma for chiming in here.
That sounds like a great first step. The important part is that we push down the predicate from Polars into PyIceberg. Iceberg is designed to work with large tables, and not being able to prune files would result in very poor performance.
I fully agree. I think that would be a great second step, but would probably be a bit more complex. We don't integrate in the way with arrow that would be ideal, but we're working on this (probably would take some time). This would require when an action is being done on a dataset, it would need to call pyiceberg to do the planning (and do all the Iceberg optimizations). I'm happy to help, but I'm less familiar with Polars, so it would be awesome if you could work on the integration on that side 🚀 |
sure, sounds good! let me get a little familiar with the iceberg codebase and then get back on this. Once iceberg <=> arrow link works as expected, adding this on polars side will be easy. |
@bitsondatadev mentioned Polars integration. Looking at it a second time, I think we can implement this by just constructing a LazyFrame, similar to in: https://github.com/pola-rs/polars/blob/main/py-polars/polars/io/pyarrow_dataset/anonymous_scan.py We would instantiate a |
@Fokko FYI, I also asked on an issue I found on the Polars gh, if they prefer rust vs python implementation. |
Took a stab at it, and it seems to work fine: https://www.loom.com/share/2b2dfbbada6e4fac88d0d0070e31f99f |
@Fokko saw the video and approach, I checked this out before. The problem is that See this behaviour here. I guess the changes need to be done on the pyiceberg side to return a pyarrow If you see delta also has the same. |
I agree with you there, but it happens after the filtering, so PyIceberg will already prune the unrelated files, and filter out the unrelated data. The problem with With Iceberg's hidden partitioning, we don't have to do things like:
Which I think is very user-unfriendly, because if you don't pass the partition, it will also cause Polars to read too much data. Since |
Another thing that seems to be bottlenecked currently one you materialize to arrow the size explode a lot as the dictionary encoding of parquet are not kept apache/arrow#20110 |
Polars has support for Iceberg: pola-rs/polars#10375 🚀 |
Feature Request / Improvement
@nazq @rdblue
Polars gives a good support to on top of columnar dataset using lazy and dataframe operations. It will be great to have a Polars based compute engine natively supported on top of Iceberg.
I have tested for a Iceberg store backed by parquet file we can directly load the Polars Dataframe using the PyIceberg/Java Iceberg APIs.
Query engine
None
The text was updated successfully, but these errors were encountered: