Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ddl): use column names, not position, for insertion order #9264

Merged
merged 6 commits into from
May 31, 2024

Conversation

gforsyth
Copy link
Member

@gforsyth gforsyth commented May 28, 2024

If we have an ir.Table (or something that will be converted to a memtable before insertion), we now generate the column list for the sge.Insert using the column ordering of the to-be-inserted-table, not the parent table. This is technically still positional ordering, but the ordering matches up with the target table.

So if we have t

┏━━━━━━━┳━━━━━━━┓
┃ xy     ┃
┡━━━━━━━╇━━━━━━━┩
│ int64int64 │
├───────┼───────┤
│     60 │
│     51 │
└───────┴───────┘

and t2

┏━━━━━━━┳━━━━━━━┓
┃ yx     ┃
┡━━━━━━━╇━━━━━━━┩
│ int64int64 │
├───────┼───────┤
│     13 │
│     24 │
└───────┴───────┘

then we will generate something like INSERT INTO t (y, x) (SELECT * FROM t2) which will yield

>>> con.insert("t", t2)

>>> con.tables.t
┏━━━━━━━┳━━━━━━━┓
┃ xy     ┃
┡━━━━━━━╇━━━━━━━┩
│ int64int64 │
├───────┼───────┤
│     60 │
│     51 │
│     31 │
│     42 │
└───────┴───────┘

The other case is where a user passes in records directly, which we will convert into an intermediate memtable, something like:

>>> ibis.memtable([(1, 2), (3, 4)])
┏━━━━━━━┳━━━━━━━┓
┃ col0col1  ┃
┡━━━━━━━╇━━━━━━━┩
│ int64int64 │
├───────┼───────┤
│     12 │
│     34 │
└───────┴───────┘

We definitely cannot use those column names to handle insertion, so in this case, we revert back to using the parent tables columns, which results in using positional ordering for the insert, which seems like the only reasonable thing to do in this scenario.

That SQL will look something like INSERT INTO t (x,y) (SELECT * FROM some_memtable)

>>> con.insert("t", [(1, 2), (3, 4)])  # implicit positional insert

>>> con.tables.t
┏━━━━━━━┳━━━━━━━┓
┃ xy     ┃
┡━━━━━━━╇━━━━━━━┩
│ int64int64 │
├───────┼───────┤
│     60 │
│     51 │
│     31 │
│     42 │
│     12 │
│     34 │
└───────┴───────┘

Resolves #9249

  • fix(impala): use column names for insertion order, not position
  • fix(sqlite): use column names, not position, for insertion order
  • fix(SQLBackend): use column names, not position, for insertion order
  • fix(clickhouse): use column names, not position, for insertion order
  • test(all): add test case for insert-by-name

@gforsyth
Copy link
Member Author

Hmm, I'm going to have to rethink this -- moving this to draft for the moment

@gforsyth gforsyth marked this pull request as draft May 28, 2024 19:50
@gforsyth gforsyth marked this pull request as ready for review May 28, 2024 22:01
@gforsyth
Copy link
Member Author

Ok, this is ready for a look, but I'll be OOO for the next while, so feel free to fix it up or nuke it as necessary.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good!

We should probably start to do some consolidation of backend DDL code, even if we don't get to the point of having something like a set of internal DDL op classes.

@cpcloud cpcloud added this to the 9.1 milestone May 31, 2024
@cpcloud
Copy link
Member

cpcloud commented May 31, 2024

Consolidate the implementation under a new private backend on SQLBackend (_build_insert_query)

@cpcloud cpcloud enabled auto-merge (squash) May 31, 2024 14:54
@cpcloud cpcloud merged commit 3506f40 into ibis-project:main May 31, 2024
74 checks passed
@gforsyth gforsyth deleted the insert-by-name branch June 10, 2024 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: .insert() uses column position, not names, to align
2 participants