Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: deprecate register api #8863

Merged
merged 2 commits into from
May 29, 2024

Conversation

ncclementi
Copy link
Contributor

@ncclementi ncclementi commented Apr 2, 2024

Looks like I messed something up in the rebase, I'll fix it soon.

@ncclementi ncclementi changed the title chore: deprecate register api [WIP] refactor: deprecate register api Apr 4, 2024
@ncclementi ncclementi requested a review from gforsyth April 4, 2024 18:15
Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @ncclementi !

The deprecation implementation is good -- I think some of the tests can be updated now and some should probably wait until we have a standardized read_in_memory

ibis/backends/datafusion/tests/test_register.py Outdated Show resolved Hide resolved
ibis/backends/duckdb/tests/test_client.py Outdated Show resolved Hide resolved
ibis/backends/duckdb/tests/test_register.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved
Comment on lines 298 to 305
with pytest.warns(FutureWarning, match="v9.0"):
t = con.register(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-memory tests we can probably leave alone for now, but we should add a comment that these should be updated to work with the upcoming read_in_memory and we should also add a polars dataframe to the mix

ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved
@ncclementi
Copy link
Contributor Author

ncclementi commented Apr 9, 2024

I modified the tests in question to use read_csv or read_parquet I wasn't sure what the name of the test should be or if we need them given that there are separate read_csv and read_parquet, though the register tests seem to be more complete. Happy to re-visit.

@ncclementi ncclementi force-pushed the deprecate-register-api branch 2 times, most recently from 7097b5d to 5b02581 Compare April 10, 2024 20:06
@ncclementi
Copy link
Contributor Author

Update regarding adapting test_register_csv and similar to use con.read_csv

Turns out that doing this would require to modify the tests quite a bit given that now this will trigger some XPASS failures, and the tests will become way more convoluted than what they are at the moment.

After chatting with @cpcloud, the suggestion was to leave the tests as they are now, and discuss a decision for when we remove the register api completely.

@gforsyth
Copy link
Member

Ok, I'm good with the changes here, but I think we need a consensus from a few folks on whether to merge now or not.
If we want to add this in Ibis 9.0, I think we also need to handle standardizing read_in_memory xref #6593 before we release, because it seems irresponsible to deprecate functionality without offering a replacement.

@cpcloud
Copy link
Member

cpcloud commented Apr 12, 2024

@gforsyth Brings up a good point. Let's hold off on merging to 9.0 for now to avoid delaying its release. We can merge this soon after 9.0 is cut (not immediately, just to wait for the bug reports to come in and cut a 9.0.1 or 9.1 bug fix release).

@ncclementi
Copy link
Contributor Author

I just rebased, I think when CI finishes we can get this one in since 9.0 is out

@ncclementi
Copy link
Contributor Author

@gforsyth should we get this one in, or are there still concerns merging this deprecation without having an alternative read_in_memory api?

@gforsyth
Copy link
Member

should we get this one in, or are there still concerns merging this deprecation without having an alternative read_in_memory api?

My preference is to have a replacement available -- I lost track of that work, but let me see if I can pick it back up.
In the interim, we'll want to update all the deprecated as of messages to be 9.1 instead of 9.0

@ncclementi
Copy link
Contributor Author

Ready, waiting for #9251 to be merged.

gforsyth added a commit that referenced this pull request May 29, 2024
This PR adds/codifies support for passing in-memory data to
`create_table`.

The default behavior for most backends is to first create a `memtable`
with
whatever `obj` is passed to `create_table`, then we create a table based
on that
`memtable` -- because of this, semantics around `temp` tables and
`catalog.database` locations are handled correctly. 

After the new table (that the user has provided a name for) is created,
we
drop the intermediate `memtable` so we don't add two tables for every
in-memory
object passed to `create_table`.

Currently most backends fail when passed `RecordBatchReaders`, or a
single
`RecordBatch`, or a `pyarrow.Dataset` -- if we add support for these to
`memtable`, all of those backends would start working, so I've marked
those
xfails as `notimpl` for now.

A few backends _don't_ work this way:

`polars` reads in the table directly using their fast-path local-memory
reading stuff.

`datafusion` uses a fast-path read, then creates a table from the table
that is
created by the fast-path -- this is because the `datafusion` dataframe
API has
no way to specify things like `overwrite`, or table location, but the
CTAS from
already present tables is very quick (and _possibly_ zero-copy?) so no
issue
there.

`duckdb` has a refactored `read_in_memory` (which we should deprecate),
but it
isn't entirely hooked up inside of `create_table` yet, so some paths may
go via
`memtable` creation, but `memtable` creation on DuckDB is especially
fast, so
I'm all for fixing this up eventually.

`pyspark` works with the intermediate `memtable` -- there are possibly
fast-paths available, but they aren't currently implemented.

`pandas` and `dask` have a custom `_convert_object` path


TODO:
* ~[ ] Flink~  Flink can't create tables from in-memory data?
* [x] Impala
* [x] BigQuery 
* [x] Remove `read_in_memory` from datafusion and polars

Resolves #6593 
xref #8863


Signed-off-by: Gil Forsyth <[email protected]>
- refactor(duckdb): add polars df as option, move test to backend suite
- feat(polars): enable passing in-memory data to create_table
- feat(datafusion): enable passing in-memory data to create_table
- feat(datafusion): use info_schema for list_tables
- feat(duckdb): enable passing in-memory data to create_table
- feat(postgres): allow passing in-memory data to create_table
- feat(trino): allow passing in-memory date to create_table
- feat(mysql): allow passing in-memory data to create_table
- feat(mssql): allow passing in-memory data to create_table
- feat(exasol): allow passing in-memory data to create_table
- feat(risingwave): allow passing in-memory data to create_table
- feat(sqlite): allow passing in-memory data to create_table
- feat(clickhouse): enable passing in-memory data to create_table
- feat(oracle): enable passing in-memory data to create_table
- feat(snowflake): allow passing in-memory data to create_table
- feat(pyspark): enable passing in-memory data to create_table
- feat(pandas,dask): allow passing in-memory data to create_table

---------

Signed-off-by: Gil Forsyth <[email protected]>
Co-authored-by: Phillip Cloud <[email protected]>
@cpcloud cpcloud merged commit 7a39bd3 into ibis-project:main May 29, 2024
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: deprecate register functionality
3 participants