Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add support for Iceberg #10375

Merged
merged 40 commits into from
Sep 19, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 8, 2023

Resolves #6227

Using PyIceberg:

  • Implements field-id-based schema resolution
  • Integration with Catalogs, among others, Hive, Glue, DynamoDB, REST, and SqlCatalog.
  • Both partition and metrics-based data file pruning; will only load files that are relevant to the expression. Reducing the data being read and the number of calls to the object store.
  • Support for AWS, Azure, and GCS is in the next release.
  • Using the PyArrow backend.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Aug 8, 2023
@Fokko Fokko force-pushed the fd-add-pyiceberg-support branch 2 times, most recently from a53eef2 to 92f10e2 Compare August 8, 2023 19:02
@stinodego
Copy link
Member

stinodego commented Aug 14, 2023

Hi @Fokko, looks like you've done quite some work here! Could you please make sure the CI passes? Then we'll take a closer look.

Let us know if you need any help. I'll put this on draft for now.

@stinodego stinodego marked this pull request as draft August 14, 2023 11:27
@Fokko
Copy link
Contributor Author

Fokko commented Aug 14, 2023

@stinodego Thanks!

Could you please make sure the CI passes? Then we'll take a closer look.

Yes, I'm on top of it. PyIceberg is still at Pydantic <2. I have a PR ready on the PyIceberg side to bump it to 2, then we need to do a release, and then it should pass the CI.

@Fokko Fokko force-pushed the fd-add-pyiceberg-support branch 10 times, most recently from 2f06a24 to 6a82d0d Compare September 7, 2023 09:55
py-polars/pyproject.toml Outdated Show resolved Hide resolved
return pl.LazyFrame._scan_python_function(arrow_schema, func, pyarrow=True)


def _scan_pyarrow_dataset_impl(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a second _scan_pyarrow_dataset_impl? Can we not reuse the existing implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Could you come back to me on this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Here we do something different than in the original one. In the original, the Python eval function is used to convert the string containing Python to an actual Python class, and that's being passed into the delta library:

    if predicate:
        # imports are used by inline python evaluated by `eval`
        from polars.datatypes import Date, Datetime, Duration  # noqa: F401
        from polars.utils.convert import (
            _to_python_datetime,  # noqa: F401
            _to_python_time,  # noqa: F401
            _to_python_timedelta,  # noqa: F401
        )
    
        _filter = eval(predicate)

What we do here is that we take the string, we convert it into an abstract syntax tree, and that's being traversed to convert it into a PyIceberg expression. The reason why I did this is that the PyArrow expression doesn't have any Python methods to traverse the expression (the same goes for the Polars expression, otherwise I could just traverse that one as well). I've added this to the docstring as well 👍🏻

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. I'm on holiday next week. Feel free to merge once pyiceberg 0.5.0 has been released and the PR has been updated accordingly.

@Fokko
Copy link
Contributor Author

Fokko commented Sep 18, 2023

@stinodego @alexander-beedie PyIceberg 0.5.0 has been released, and the PR is good to go :)

@alexander-beedie
Copy link
Collaborator

Alright... here we go ;)

@ritchie46 ritchie46 merged commit d8d005d into pola-rs:main Sep 19, 2023
12 checks passed
@Fokko Fokko deleted the fd-add-pyiceberg-support branch September 19, 2023 08:08
@Fokko
Copy link
Contributor Author

Fokko commented Sep 19, 2023

Awesome, thanks all! 🚀

@alexander-beedie
Copy link
Collaborator

Awesome, thanks all! 🚀

It's a great addition; looking forward to seeing it in action 👍

@Fokko
Copy link
Contributor Author

Fokko commented Sep 19, 2023

I did a small demo at PyData Amsterdam last week. Once it is released, I'll work on a blogpost! 👍🏻

@luancaarvalho
Copy link

@Fokko Do you have the demo at PyData Amsterdam recorded ?

@Fokko
Copy link
Contributor Author

Fokko commented Sep 19, 2023

@luancaarvalho The demo is recorded, but not yet online. I think they still have to do some final editing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature highlight Highlight this PR in the changelog python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Read support for Apache Iceberg
5 participants