Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pushdowns not being applied correctly during optimization #2616

Closed
universalmind303 opened this issue Aug 5, 2024 · 0 comments · Fixed by #2635
Closed

pushdowns not being applied correctly during optimization #2616

universalmind303 opened this issue Aug 5, 2024 · 0 comments · Fixed by #2635

Comments

@universalmind303
Copy link
Collaborator

Simple query to find the first 10 transactions that have pizza in the description.

df = daft.read_csv('~/Downloads/transactions.csv')
df = (df
  .where(df['description'].str.lower().str.like('%pizza%'))
  .select(daft.col('description'))
  .limit(10)
)

the unoptimized plan is as one would expect

== Unoptimized Logical Plan ==

* Limit: 10
|
* Project: col(description)
|
* Filter: like(lower(col(description)), lit("%pizza%"))
|
* GlobScanOperator
|   Glob paths = [~/Downloads/transactions.csv]
|   File schema = transaction_date#Date, posted_date#Date, card_no#Int64,
|     description#Utf8, category#Utf8, debit#Float64, credit#Float64
|   Partitioning keys = []
|   Output schema = transaction_date#Date, posted_date#Date, card_no#Int64,
|     description#Utf8, category#Utf8, debit#Float64, credit#Float64

A few strange things i noticed.

  • The optimized plan has the filter and limit pushed down, but also shows them as their own nodes
  • How does a limit get pushed down with a filter? usually if a filter is present, the filter needs to be performed first acting as a pushdown barrier for the limit
  • why is the projection apparently not being pushed down at all? neither the filter or the limit are pushdown barriers for the projection
== Optimized Logical Plan ==

* Project: col(description)
|
* Limit: 10
|
* GlobScanOperator
|   Glob paths = [~/Downloads/transactions.csv]
|   File schema = transaction_date#Date, posted_date#Date, card_no#Int64,
|     description#Utf8, category#Utf8, debit#Float64, credit#Float64
|   Partitioning keys = []
|   Filter pushdown = like(lower(col(description)), lit("%pizza%"))
|   Limit pushdown = 10
|   Output schema = transaction_date#Date, posted_date#Date, card_no#Int64,
|     description#Utf8, category#Utf8, debit#Float64, credit#Float64

the physical plan shows no information about if the filters/limits are actually pushed down, so it's hard to corroborate what is happening at the physical level.

== Physical Plan ==

* Project: col(description)
|   Clustering spec = { Num partitions = 1 }
|
* Limit: 10
|   Eager = false
|   Num partitions = 1
|
* TabularScan:
|   Num Scan Tasks = 1
|   Estimated Scan Bytes = 19366
|   Clustering spec = { Num partitions = 1 }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant