Keynote presentation for SiMoD workshop at SIGMOD 2024 #10481

alamb · 2024-05-13T11:01:31Z

I am giving an invited keynote talk at a workshop colocated with SIGMOD 2024 on Friday Jun 14, 2024 (after the main conference).

I need to prepare slides for this and figured people in the DataFusion community might be interested

DataFusion: The Case for Building Data Systems using Open Standards:

Abstract: Andrew will discuss engineering tradeoffs made when building Apache DataFusion, an open source and extensible query engine used as the basis of many commercial and open source projects. These decisions (mostly) favored simplicity and worked better than initially expected. He will cover the rationale for which parts of DataFusion use pre-existing standards such as Arrow and Parquet, and which parts are built “from scratch” such as vectorized hashing and normalized sort keys. He will also discuss DataFusion’s design philosophy of extensible APIs paired with simple default implementations. Finally, he will offer lessons learned and enumerate some things that worked well and what could have been improved.

alamb · 2024-05-13T11:01:54Z

Here are some notes I have on what I want to talk about

interfaces and then paradoxically allowed us to narrow the scope of potential optimizations (e.g. compute kernels) and have people focus on different areas.

Things we didn't implement:

File formats (instead focused on Parquet, avro, arrow, json, csv)
Memory format Arrow (not just externally but internally)
threadpool standard (tokio) vs our own thread pool
pull / exchange rather than morsel driven parallelism
standard I/O rather than buffer pool
latest / greatest window aggregates fanciness (todo get paper link)

Providing simple built in defaults, but hooks for more specialized implementations
Keeps DF simple, allows

Catalog
memory / disk manager

Things we did: places we spent time and complexity

normalized keys / row format
optimizing parquet reader
optimizing hashing
plan representation (logical plans, exprs, etc)
function library
ListingTable (maybe this should have been more

Things I would do differently next time:
Keep listing table out of the core
UDFs from the start

alamb · 2024-06-10T21:53:47Z

Here is the presentation. I will post it more broadly once I have worked on it a bit more

https://docs.google.com/presentation/d/1K3EdknzkqU2LhWi_eNKXdcvNk0OEvk9AqTLqhZkPxuI/edit#slide=id.p

alamb · 2024-06-14T17:50:52Z

Its done! I'll try and record this talk too at some point and post it on http://andrew.nerdnetworks.org/

alamb added the documentation Improvements or additions to documentation label May 13, 2024

alamb self-assigned this May 13, 2024

alamb mentioned this issue Jun 3, 2024

DataFusion weekly project plan (Andrew Lamb) - June 3, 2024 #10779

Closed

8 tasks

alamb mentioned this issue Jun 10, 2024

Minor: Improve ListingTable documentation #10854

Merged

alamb mentioned this issue Jun 11, 2024

DataFusion weekly project plan (Andrew Lamb) - June 10, 2024 #10869

Closed

7 tasks

alamb closed this as completed Jun 14, 2024

alamb mentioned this issue Jun 17, 2024

DataFusion weekly project plan (Andrew Lamb) - June 17, 2024 #10955

Closed

5 tasks

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keynote presentation for SiMoD workshop at SIGMOD 2024 #10481

Keynote presentation for SiMoD workshop at SIGMOD 2024 #10481

alamb commented May 13, 2024

alamb commented May 13, 2024

alamb commented Jun 10, 2024

alamb commented Jun 14, 2024

Keynote presentation for SiMoD workshop at SIGMOD 2024 #10481

Keynote presentation for SiMoD workshop at SIGMOD 2024 #10481

Comments

alamb commented May 13, 2024

alamb commented May 13, 2024

alamb commented Jun 10, 2024

alamb commented Jun 14, 2024