perf: Avoid recursive union all by name call for duckdb diagonal concat#3399
perf: Avoid recursive union all by name call for duckdb diagonal concat#3399
union all by name call for duckdb diagonal concat#3399Conversation
union all by name call for duckdb diagonal concat
|
Thanks @FBruzzesi Note take all of this with a grain of salt, as my extent of DuckDB use has been within BenchmarkDouble thanks for including a benchmark!
I am left with some more questions though, which should be easy enough to get to the bottom of 🤞 ConfigurationsThe most general conclusion I can make is that there looks to be more of a benefit as the size and number of datasets increase. At the lower end, the speedup is narrow and weirdly Would you be able to extend the config to give a broader view? Currentn_frames = 5, 10, 20
n_rows= 10_000 100_000
n_cols = 20, 50ProposedI would hope when we go below the current This also extends the upper bounds for 2/3 parameters, where I'm expecting to continue seeing a perf boost 😎 n_frames = 2, 5, 10, 20, 50
n_rows= 100, 1_000, 10_000 100_000
n_cols = 5, 10, 20, 50Presentation
Could you replace these with I'm only suggesting to replace, since the table will be pretty big after the other request 😂 |
|
Thanks @dangotbanned
That makes two of us!
I have a 16 GB machine 500GB Drive, I can barely run the (20, 100_000, 50) config, 50 dataframes will go OOM
This is quite easy to achieve |
Ah yes, that'll teach me for plucking numbers out of thin air 😂 |
|
Caution Removed conclusions comment as inaccurate |
|
I am going to close this as it does not seem to have particular improvement. We can come back to it either once they support union by name in the python API or via |
Description
I run a benchmark locally with different configurations (times are average on 3 runs per configuration), and overlap fraction of 50% of the columns:
What type of PR is this? (check all applicable)
Related issues
concat(..., how="*_relaxed"})#3398 (comment)Checklist