You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat!: Add support for sparse transform expressions (#1199)
## What changes are proposed in this pull request?
Log replay needs to define per-file transforms for row of metadata that
survives data skipping. Unfortunately, `Expression::Struct` is "dense"
(mentions _every_ output field) and this produces excessive overhead
when injecting the (usually very few) partition columns for tables with
wide schemas (hundreds or thousands of columns). Column mapping tables
are even worse, because they don't _change_ the columns at all -- they
just need to apply the output schema to the input data.
Solution: Define a new `Expression::Transform` that is a "sparse"
representation of the _changes_ to be made to a given top-level schema
or nested struct. Input columns can be dropped or replaced, and new
output columns can be injected after an input column of choice (or
prepended to the output schema). The engine's expression evaluator does
the actual work to transfer unchanged input columns across while
building the output `EngineData`.
Update log replay to use the new transform capability, so that the cost
is `O(partition_columns)` instead of `O(schema_width)`. For
non-partitioned tables with column mapping mode enabled, this translates
to an empty (identity) transform which the default engine expression
evaluator has been updated to optimize as a special case (just
`apply_schema` directly to the input and return).
Result: Scan times are cut by nearly half in the `metadata_bench`
benchmark:
```
Benchmarking scan_metadata/scan_metadata: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 5.1s, or reduce sample count to 10.
scan_metadata/scan_metadata
time: [250.51 ms 253.72 ms 257.45 ms]
change: [-45.173% -44.306% -43.415%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
1 (5.00%) high mild
```
### This PR affects the following public APIs
Added a new `Expression::Transform` enum variant.
## How was this change tested?
New and existing unit tests, existing benchmarks.
0 commit comments