Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Read support for Apache Iceberg #6227

Closed
nazq opened this issue Jan 14, 2023 · 14 comments · Fixed by #10375
Closed

Add Read support for Apache Iceberg #6227

nazq opened this issue Jan 14, 2023 · 14 comments · Fixed by #10375
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@nazq
Copy link

nazq commented Jan 14, 2023

Problem description

Would like to see read support for Apache Iceberg similar to the support for delta. Cc @asheeshgarg

@nazq nazq added the enhancement New feature or an improvement of an existing feature label Jan 14, 2023
@nazq
Copy link
Author

nazq commented Jan 14, 2023

Discussion on Rust iceberg sdk apache/iceberg#5122

@Guillem96
Copy link

Is this feature already in the Polars team roadmap?

@luancaarvalho
Copy link

Could someone please advise where we can access the roadmap and verify if iceberg has been included in the Polars roadmap?

@tomaszdudek7
Copy link

I'd love to have it!

@alexander-beedie
Copy link
Collaborator

Would like to see read support for Apache Iceberg similar to the support for delta.

FYI: I'm looking at further generalising our database support in Python (ref: #10121). Are you intending to use directly from Rust, or the usual Python API? (If the latter, which driver do you usually use?)

@bitsondatadev
Copy link

@alexander-beedie, I noticed Delta is implemented in Python + Arrow.

Iceberg comitters are working on improving Arrow compatibility, but I wonder if you all prefer Rust support? This is also almost complete as well. What's the best way forward in your opinion as both options are available?

  • Brian (DevRel Iceberg)

@bitsondatadev
Copy link

Also related: apache/iceberg#7067

@luancaarvalho
Copy link

In my case, python @alexander-beedie

@chitralverma
Copy link
Contributor

My 2 cents on the topic, we can mimic the way we implemented things for delta and keep this on the python side of things via the scan_ds

@universalmind303
Copy link
Collaborator

universalmind303 commented Aug 8, 2023

IMO, it'd be preferred to do this on the rust side. That way we can have support for it in sql, python, ....

a datafusion based project glardb recently added iceberg support. Looks like it may be easy to port their datafusion logic over to polars. The dependencies also seem pretty lightweight, only iceberg which only depends on apache-avro. (no dependencies on apache-arrow!)

Fokko added a commit to Fokko/polars that referenced this issue Aug 8, 2023
Fokko added a commit to Fokko/polars that referenced this issue Aug 8, 2023
Fokko added a commit to Fokko/polars that referenced this issue Aug 8, 2023
@Fokko
Copy link
Contributor

Fokko commented Aug 8, 2023

I got a working version using PyIceberg here: #10375

a datafusion based project glardb recently added iceberg support. Looks like it may be easy to port their datafusion logic over to polars.

Unfortunately, a lot of the Rust implementations out there are far from complete. Looking at the implementation at GlareDB, a couple of things are missing:

  • No field-id schema resolution, and therefore it cannot read tables with evolved schemas (or it causes correctness issues!)
  • Pruning of partitions
  • Pruning of files, by using the metrics stored in the manifest files

The implementation only does file discovery, which is a pity since Iceberg has so much to offer.

Fokko added a commit to Fokko/polars that referenced this issue Aug 8, 2023
Fokko added a commit to Fokko/polars that referenced this issue Aug 8, 2023
Fokko added a commit to Fokko/polars that referenced this issue Aug 8, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 9, 2023

Iceberg comitters are working on improving Arrow compatibility, but I wonder if you all prefer Rust support?

Apologies, and thanks for reaching out to us! This seemed to slip my radar; looks like @Fokko is well underway building a scan_ based approach in the Rust layer already, which looks exciting. For my own part I was looking at generalising our SQL interop so that we can handle user-instantiated connections. I'm primarily looking into Python-side SQL connectivity (I think there is also a SQL interface to Iceberg? I could be conflating DuckDB interacting with an underlying Iceberg store for native SQL though ;)

@Fokko
Copy link
Contributor

Fokko commented Aug 9, 2023

Hey @alexander-beedie, thanks for jumping in here. What kind of SQL interface are you thinking of?

We're supporting SQL-like syntax for the expressions:

large_rides_in_march = tbl.scan().filter("dt >= '2023-01-01' and dt < '2023-04-01' and passenger_count > 4").to_arrow()

We could also accept this kind of expression Polars. Also, DuckDB has recently opened up their Iceberg support, however, this is also still in an experimental state and many of the features that make Iceberg shine are still missing.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 9, 2023

Hey @alexander-beedie, thanks for jumping in here. What kind of SQL interface are you thinking of? // ...

Great, I see it now - thanks for the clarification. I think I was conflating a quickly-scanned article involving DuckDB & Iceberg with there actually being a fully-fledged SQL query interface (and therefore also Python-side DBAPI -or equivalent- drivers), in which case things would "just work" once we start accepting user-created connection objects and their associated queries.

Looks to me like the scan_iceberg PR you're working on is the way to go here as it'll be able to take advantage of a fuller range of features; I will defer on the implementation details review/comments to those with more Rust expertise than myself (@ritchie46, @orlp, @universalmind303), as I am merely a casual dilettante in that arena ;)

@stinodego stinodego added the accepted Ready for implementation label Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants