-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve API for choosing a backend on memtable.cache() #10942
Comments
I think a 5th alternative, and my preference, and also probably the most difficult to implement, is to make I don't think 1-4 really align well with the API we want to present to users. With a fully lazy cache operation, the UX doesn't really change. |
Being lazy does sound like the best ux. Consider the line |
I think this will have negative repercussions for anyone not using a local backend. |
One complication this adds to the UX is that the meaning of the current cache context manager becomes unclear. Maybe there's a clear semantics for it, but it doesn't immediately jump out at me. |
For example, if a computation under the context manager never runs with foo.cache() as expr: # what happens in __exit__?
... |
Oof, yeah that does sound hard to pin down the semantics. If I had done a ibis.set_backend(my_conn) at the beginning, it would work. But that doesn't seem like a good solution to tell people they need to do this. It seems like if we are too tricky and start doing things on the users behalf by choosing the backend, it seems quite possible we would choose the wrong one, which could have quite bad performance characteristics. Maybe we just want to give a more helpful error, actually keeping tracking that a cached table was derived from a memtable, and then in the error message pointing users to a FAQ page on the website. Maybe we add some option to automatically cross between backends via pyarrow, but probably it should be an opt-in |
Is your feature request related to a problem?
Consider the following script:
With an original memtable, there is no backend. But as soon as you .cache() it, then it ends up in the default backend. This is a problem when I am trying to make this memtable interact with a backend that is not the default.
I have actually sporadically been encountering this issue for literally 2 years, and I only am finally now realizing what the root cause is. It was so hard to figure out the cause because its such spooky action at a distance, adding the .cache to some distant line of code made the error only pop up much later in the script.
What is the motivation behind your request?
The workaround is to never .cache() any memtables. But some of the intermediate computations I am doing are expensive, so I really do want to be able to cache them in the middle of the computation chain.
Describe the solution you'd like
A few ideas, none of which I love:
1. Optional backend param to memtable
Add a
backend=None
param to ibis.memtable. Then, whenever a subsequent .cache() happens, the expression uses this backend. This unfortunately makes it so that if you domt = ibis.memtable(..., backend=conn)
, then mt is forever only compatible with conn, you can't use it with conn2, which is a little counterintuitive to my mental model of a memtable, which is an ibis table that is in-memory and thus works with any backend.2. Optional backend param to .cache()
Really, the time at which we need to decide on a backend for a memtable is only when we start running computations. We ideally shouldn't need to specify the backend at expression creation time. So, perhaps an API is
Table.cache(backend=None)
. But then this is awkward, because you could dobackend1.table("t").cache(backend=backend2)
, which should be illegal. It would be better if our API made this impossible to do.3. Add Backend.cache() method
This keeps the API of ibis.memtable and Table.cache from needing the backend param, which is nice.
4. On memtable.cache(), you get another backend-agnostic memtable
Like it would use the default backend (usually duckdb), compute the result, but then return a special Op that, when required, actually goes to the backend, calls eg .to_pyarrow on the result, and then hands you back a new ibis.memtable from that.
e.g. equivalent to
ibis.memtable(t.to_pyarrow(), schema=t.schema())
. Ideally we could make it so that this was lazy, and it only materialized the memtable on demand when crossing the boundary between two backends.The most complex solution, but the best user API.
Out of all these, I really want to avoid the middle 2 options, because I want to make my
expensive_computation
function backend-agnostic. I don't want to have to worry about what backend the given table is in, and both of the second two require the user to choose a backend at cache time.What version of ibis are you running?
main
What backend(s) are you using, if any?
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: