StatisticsV2: statistics framework initial redesign for Datafusion#57
StatisticsV2: statistics framework initial redesign for Datafusion#57Fly-Style wants to merge 122 commits intosynnada-ai:apache_mainfrom
Conversation
…frastructure for stats top-down propagation and final bottom-up calculation
…tion phase; add compute_range function
…s, todos for the future
…and inequations distribution combinations
…ibutions with known ranges
berkaysynnada
left a comment
There was a problem hiding this comment.
I really like how the code is organized, and your clean coding style, even though it’s still draft.
While reviewing ExprStatisticGraph, I wondered if introducing another distribution, singleton, to represent constants and known values, might improve us. It seems like this could simplify the representation of literals significantly. However, I’m not sure if it would complicate the computation process. Let’s discuss this too.
I’ll also think more about the propagation of distributions. However, it seems unlikely that the current approach for range propagations can be improved further.
berkaysynnada
left a comment
There was a problem hiding this comment.
I've sent a commit that has mostly minor code improvements. If you see any arguable change, we can discuss it.
What I see as a further improvement is implementing CastExpr methods as it is used very widely. Another major part is for ScalarFunctionExpr's. If we can also practice one of them (we can select the most trivial one), that would be highly beneficial.
|
Once we resolve these last items, @ozankabak will take a final look, and we can upstream this PR. |
Bumps [testcontainers](https://github.com/testcontainers/testcontainers-rs) from 0.23.2 to 0.23.3. - [Release notes](https://github.com/testcontainers/testcontainers-rs/releases) - [Changelog](https://github.com/testcontainers/testcontainers-rs/blob/main/CHANGELOG.md) - [Commits](testcontainers/testcontainers-rs@0.23.2...0.23.3) --- updated-dependencies: - dependency-name: testcontainers dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [serde](https://github.com/serde-rs/serde) from 1.0.217 to 1.0.218. - [Release notes](https://github.com/serde-rs/serde/releases) - [Commits](serde-rs/serde@v1.0.217...v1.0.218) --- updated-dependencies: - dependency-name: serde dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com>
* build moves only tests+benches pending * unstable * some tests fixed * Mock MemorySourceConfig and DataSource * some cleanup * test pass but Mock is not efficient * temporary stable * one struct, test pass, cleaning pending * cleaning * more cleaning * clippy * 🧹🧹*cleaning*🧹🧹 * adding re-export * fix:cargo fmt * fix: doctest * fix: leftout doctest * fix: circular dependency * clean, rename, document, improve
* FileSource specific repartitioning * fix doc typo * remove * Avro doesn't support repartitioning
* Bump MSRV to 1.82, toolchain to 1.85 * Fix some clippy warnings * Fix more clippy warnings
* Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* Add unit tests to FFI_ExecutionPlan * Add unit tests for FFI table source * Add round trip tests for volatility * Add unit tests for FFI insert op * Simplify string generation in unit test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix drop of borrowed value --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…prPlanner, add `plan_aggregate` and `plan_window` to planner (apache#14689) * count planner * window * update slt * remove rule * rm rule * doc * fix name * fix name * fix test * tpch test * fix avro * rename * switch to count(*) * use count(*) * rename * doc * rename window funciotn * fmt * rm print * upd logic * count null
* fix: normalize column names in table constraints * newline * Move slt * restore ddl.slt
apache#14815) * fix: we are missing the unlimited case for bounded streaming when using datafusion-cli * Address comments
- I wasn't able to quickly find where the MSRV was defined when filing apache#14808 so I would like to make it easier to find nex time
* simplify fn signature * .
* Enable 'extended tests' on forks Allow contributors to run extended tests workflow if they wish to, just like they can run rust tests workflow on their forks, before opening a PR to DataFusion. GitHub allows enabling/disabling workflows in the web UI without needing to change workflow yaml file. * Remove unused crate dependencies Found by `cargo udeps`. Unfortunately there were false positives too. * one on workspace level
|
apache#14699 is merged |
Rationale for this change
https://synnada.notion.site/Redesigning-and-Enhancing-the-Statistics-Framework-in-Datafusion-16bf46d2dab180448272dbbd1d1f7cea
What changes are included in this PR?
This patch presents a Statistics v.2 framework with the following main points:
Tables of statistic execution and propagation rules for
PhysicalExpr-s:Definitions
Details
UF = Uniform UN = Unknown EXP = Exponential GSS = Gaussian BRN = BernoulliBinary arithmetical operators, evaluation.
Details
Comparison operators, evaluation.
Details
Are these changes tested?
These changes were massively tested, you can find a lot of unit tests coupled to this feature.