Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve API for choosing a backend on memtable.cache() #10942

Open
1 task done
NickCrews opened this issue Mar 5, 2025 · 6 comments
Open
1 task done

feat: improve API for choosing a backend on memtable.cache() #10942

NickCrews opened this issue Mar 5, 2025 · 6 comments
Labels
feature Features or general enhancements

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Mar 5, 2025

Is your feature request related to a problem?

Consider the following script:

import ibis

conn = ibis.duckdb.connect("mydb.duckdb")
if "my_table" not in conn.tables:
    conn.create_table("my_table", schema={"c": "int64"}, overwrite=False)


def get_new_data():
    t = ibis.memtable([{"a": 1, "b": 2}, {"a": 3, "b": 4}])
    t = t.select(c=t.a + t.b)
    print(t._find_backends())
    # ([], False)
    t = t.cache()
    print(t._find_backends())
    # ([<ibis.backends.duckdb.Backend object at 0x11df823d0>], False)
    t = t.mutate(c=t.c + 1)
    return t

def ingest(conn, new):
    print(conn)
    # <ibis.backends.duckdb.Backend object at 0x11fb4ef10>
    print(f"adding {new.count().execute()} rows to new_table")
    already_there = new.semi_join(conn.table("my_table"), "c")
    print(f"skipping {already_there.count().execute()} rows already in my_table")
    # IbisError: Multiple backends found for this expression
    return conn.insert("my_table", new)
    # Catalog Error: Table with name ibis_cached_ao56qedyyva3hjxvhrew7g3utu does not exist!

ingest(conn, get_new_data())

With an original memtable, there is no backend. But as soon as you .cache() it, then it ends up in the default backend. This is a problem when I am trying to make this memtable interact with a backend that is not the default.

I have actually sporadically been encountering this issue for literally 2 years, and I only am finally now realizing what the root cause is. It was so hard to figure out the cause because its such spooky action at a distance, adding the .cache to some distant line of code made the error only pop up much later in the script.

What is the motivation behind your request?

The workaround is to never .cache() any memtables. But some of the intermediate computations I am doing are expensive, so I really do want to be able to cache them in the middle of the computation chain.

Describe the solution you'd like

A few ideas, none of which I love:

1. Optional backend param to memtable

Add a backend=None param to ibis.memtable. Then, whenever a subsequent .cache() happens, the expression uses this backend. This unfortunately makes it so that if you do mt = ibis.memtable(..., backend=conn), then mt is forever only compatible with conn, you can't use it with conn2, which is a little counterintuitive to my mental model of a memtable, which is an ibis table that is in-memory and thus works with any backend.

2. Optional backend param to .cache()

Really, the time at which we need to decide on a backend for a memtable is only when we start running computations. We ideally shouldn't need to specify the backend at expression creation time. So, perhaps an API is Table.cache(backend=None). But then this is awkward, because you could do backend1.table("t").cache(backend=backend2), which should be illegal. It would be better if our API made this impossible to do.

3. Add Backend.cache() method

This keeps the API of ibis.memtable and Table.cache from needing the backend param, which is nice.

4. On memtable.cache(), you get another backend-agnostic memtable

Like it would use the default backend (usually duckdb), compute the result, but then return a special Op that, when required, actually goes to the backend, calls eg .to_pyarrow on the result, and then hands you back a new ibis.memtable from that.
e.g. equivalent to ibis.memtable(t.to_pyarrow(), schema=t.schema()). Ideally we could make it so that this was lazy, and it only materialized the memtable on demand when crossing the boundary between two backends.
The most complex solution, but the best user API.

Out of all these, I really want to avoid the middle 2 options, because I want to make my expensive_computation function backend-agnostic. I don't want to have to worry about what backend the given table is in, and both of the second two require the user to choose a backend at cache time.

What version of ibis are you running?

main

What backend(s) are you using, if any?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the feature Features or general enhancements label Mar 5, 2025
@cpcloud
Copy link
Member

cpcloud commented Mar 6, 2025

I think a 5th alternative, and my preference, and also probably the most difficult to implement, is to make .cache() a lazy operation, probably by making a Cache node.

I don't think 1-4 really align well with the API we want to present to users. With a fully lazy cache operation, the UX doesn't really change.

@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 6, 2025

Being lazy does sound like the best ux.

Consider the line print(f"adding {new.count().execute()} rows to new_table"). That would trigger an evaluation, but we wouldn't have any hint yet which backend we want it in. Only with the semi_join or insert above would we know which backend to use. But then it's tricky because we still might want to use the cached result in other backends. So all that is to say that I think that while we can make it a lazy op, I think we still need to make it able to export to pyarrow on demand. Possibly by default we just store at rest the cached value as a pyarrow table, but we could start running the user out of memory if they have a lot of caches, so the slower but less memory intense and harder to implement way would be to store the cached value at rest in the default backend, and only export to pyarrow on demand when switching between backends.

@gforsyth
Copy link
Member

gforsyth commented Mar 6, 2025

by default we just store at rest the cached value as a pyarrow table

I think this will have negative repercussions for anyone not using a local backend.

@cpcloud
Copy link
Member

cpcloud commented Mar 6, 2025

One complication this adds to the UX is that the meaning of the current cache context manager becomes unclear.

Maybe there's a clear semantics for it, but it doesn't immediately jump out at me.

@cpcloud
Copy link
Member

cpcloud commented Mar 6, 2025

For example, if a computation under the context manager never runs .execute()/.to_pyarrow(), what would the context manager do?

with foo.cache() as expr:  # what happens in __exit__?
    ...

@NickCrews
Copy link
Contributor Author

Oof, yeah that does sound hard to pin down the semantics.

If I had done a ibis.set_backend(my_conn) at the beginning, it would work. But that doesn't seem like a good solution to tell people they need to do this.

It seems like if we are too tricky and start doing things on the users behalf by choosing the backend, it seems quite possible we would choose the wrong one, which could have quite bad performance characteristics. Maybe we just want to give a more helpful error, actually keeping tracking that a cached table was derived from a memtable, and then in the error message pointing users to a FAQ page on the website.

Maybe we add some option to automatically cross between backends via pyarrow, but probably it should be an opt-in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

3 participants