Skip to content

StatisticsV2: statistics framework initial redesign for Datafusion#57

Closed
Fly-Style wants to merge 122 commits intosynnada-ai:apache_mainfrom
Fly-Style:feat/statistics_v2
Closed

StatisticsV2: statistics framework initial redesign for Datafusion#57
Fly-Style wants to merge 122 commits intosynnada-ai:apache_mainfrom
Fly-Style:feat/statistics_v2

Conversation

@Fly-Style
Copy link

@Fly-Style Fly-Style commented Jan 8, 2025

Rationale for this change

https://synnada.notion.site/Redesigning-and-Enhancing-the-Statistics-Framework-in-Datafusion-16bf46d2dab180448272dbbd1d1f7cea

What changes are included in this PR?

This patch presents a Statistics v.2 framework with the following main points:

  • introduction to enum-based struct to support multiple distribution types, which initially include:
    • Uniform distribution (range)
    • Gaussian distribution (also known as normal)
    • Exponential distribution
    • Bernoulli distribution (holds probability, is used as resulting distribution of comparison operators),
    • Unknown distribution, abstracts any non-represented distribution.
  • tree-based stats calculation is divided into bottom-up evaluation and top-down propagation;

Tables of statistic execution and propagation rules for PhysicalExpr-s:

Definitions

Details UF = Uniform UN = Unknown EXP = Exponential GSS = Gaussian BRN = Bernoulli

Binary arithmetical operators, evaluation.

Details
Input 1 Input 2 (+) (-) (*) (/)
UF [a,b] UF [c,d] UN UN UN UN
UF [a,b] GSS N(μ,σ²) UN UN UN UN
UF [a,b] EXP Exp(λ) UN UN UN UN
UF [a,b] BRN UN UN UN UN
UF [a,b] UN UN UN UN UN
GSS N(μ₁,σ₁²) GSS N(μ₂,σ₂²) GSS GSS GSS UN
GSS N(μ,σ²) EXP Exp(λ) UN UN UN UN
GSS N(μ,σ²) BRN UN UN UN UN
GSS N(μ,σ²) UN UN UN UN UN
EXP Exp(λ₁) EXP Exp(λ₂) UN (gamma dist) UN UN UN
EXP Exp(λ) BRN UN UN UN UN
EXP Exp(λ) UN UN UN UN UN
BRN Exp(λ) BRN UN UN UN UN
BRN Exp(λ) UN UN UN UN UN
UN UN UN UN UN UN

Comparison operators, evaluation.

Details
Input 1 Input 2 Comparison (=, ≠) Inequality (>, >=) Inequality (<, <=)
UF [a,b] UF [c,d] BRN BRN BRN
GSS N(μ₁,σ₁²) GSS N(μ₂,σ₂²) UN UN(Ф-func) UN(Ф-func)
EXP Exp(λ₁) EXP Exp(λ₂) UN UN UN
BRN BRN BRN BRN BRN
UF [a,b] GSS N(μ,σ²) UN UN UN
UF [a,b] EXP Exp(λ) UN UN UN
UF [a,b] BRN UN? UN? UN?
UF [a,b] UN BRN? BRN? BRN?
GSS N(μ,σ²) EXP Exp(λ) UN UN UN
GSS N(μ,σ²) BRN BRN?/UN BRN?/UN BRN? /UN
GSS N(μ,σ²) UN UN UN UN
EXP Exp(λ) UN UN, estimate UN, estimate UN, estimate
EXP Exp(λ) BRN UN, estimate UN, estimate UN, estimate
UN BRN BRN/UN BRN/UN BRN/UN
UN UN BRN/UN BRN/UN BRN/UN

Are these changes tested?

These changes were massively tested, you can find a lot of unit tests coupled to this feature.

@Fly-Style Fly-Style changed the title StatisticsV2: statistics framework redesign for Datafusion StatisticsV2: statistics framework initial redesign for Datafusion Jan 16, 2025
Copy link
Collaborator

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how the code is organized, and your clean coding style, even though it’s still draft.

While reviewing ExprStatisticGraph, I wondered if introducing another distribution, singleton, to represent constants and known values, might improve us. It seems like this could simplify the representation of literals significantly. However, I’m not sure if it would complicate the computation process. Let’s discuss this too.

I’ll also think more about the propagation of distributions. However, it seems unlikely that the current approach for range propagations can be improved further.

@Fly-Style Fly-Style marked this pull request as ready for review January 26, 2025 18:58
Copy link
Collaborator

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've sent a commit that has mostly minor code improvements. If you see any arguable change, we can discuss it.

What I see as a further improvement is implementing CastExpr methods as it is used very widely. Another major part is for ScalarFunctionExpr's. If we can also practice one of them (we can select the most trivial one), that would be highly beneficial.

@berkaysynnada
Copy link
Collaborator

Once we resolve these last items, @ozankabak will take a final look, and we can upstream this PR.

dependabot bot and others added 15 commits February 21, 2025 16:53
Bumps [testcontainers](https://github.com/testcontainers/testcontainers-rs) from 0.23.2 to 0.23.3.
- [Release notes](https://github.com/testcontainers/testcontainers-rs/releases)
- [Changelog](https://github.com/testcontainers/testcontainers-rs/blob/main/CHANGELOG.md)
- [Commits](testcontainers/testcontainers-rs@0.23.2...0.23.3)

---
updated-dependencies:
- dependency-name: testcontainers
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [serde](https://github.com/serde-rs/serde) from 1.0.217 to 1.0.218.
- [Release notes](https://github.com/serde-rs/serde/releases)
- [Commits](serde-rs/serde@v1.0.217...v1.0.218)

---
updated-dependencies:
- dependency-name: serde
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <jonahgao@msn.com>
* build moves only tests+benches pending

* unstable

* some tests fixed

* Mock MemorySourceConfig and DataSource

* some cleanup

* test pass but Mock is not efficient

* temporary stable

* one struct, test pass, cleaning pending

* cleaning

* more cleaning

* clippy

* 🧹🧹*cleaning*🧹🧹

* adding re-export

* fix:cargo fmt

* fix: doctest

* fix: leftout doctest

* fix: circular dependency

* clean, rename, document, improve
* FileSource specific repartitioning

* fix doc typo

* remove

* Avro doesn't support repartitioning
* Bump MSRV to 1.82, toolchain to 1.85

* Fix some clippy warnings

* Fix more clippy warnings
* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* Add unit tests to FFI_ExecutionPlan

* Add unit tests for FFI table source

* Add round trip tests for volatility

* Add unit tests for FFI insert op

* Simplify string generation in unit test

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Fix drop of borrowed value

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…prPlanner, add `plan_aggregate` and `plan_window` to planner (apache#14689)

* count planner

* window

* update slt

* remove rule

* rm rule

* doc

* fix name

* fix name

* fix test

* tpch test

* fix avro

* rename

* switch to count(*)

* use count(*)

* rename

* doc

* rename window funciotn

* fmt

* rm print

* upd logic

* count null
* fix: normalize column names in table constraints

* newline

* Move slt

* restore ddl.slt
apache#14815)

* fix: we are missing the unlimited case for bounded streaming when using datafusion-cli

* Address comments
- I wasn't able to quickly find where the MSRV was defined when filing  apache#14808 so I would like to make it easier to find nex time
* Enable 'extended tests' on forks

Allow contributors to run extended tests workflow if they wish to, just
like they can run rust tests workflow on their forks, before opening a
PR to DataFusion. GitHub allows enabling/disabling workflows in the web
UI without needing to change workflow yaml file.

* Remove unused crate dependencies

Found by `cargo udeps`. Unfortunately there were false positives too.

* one on workspace level
@berkaysynnada
Copy link
Collaborator

apache#14699 is merged

@Fly-Style Fly-Style deleted the feat/statistics_v2 branch March 3, 2025 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.