Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make probabilistic optimizations optional and tunable in the config #1912

Merged
merged 1 commit into from
Aug 25, 2023

Conversation

abcpro1
Copy link
Contributor

@abcpro1 abcpro1 commented Aug 24, 2023

Resolves #1875.

Probabilistic optimization sacrifices accuracy in order to reduce memory consumption. In certain parts of the pipeline, a Bloom Filter is used (set_processor), while in other parts, hash tables that store only the hash of the keys instead of the full keys are used (aggregation_processor and join_processor).

This PR makes these optimizations disabled by default and offers user-configurable flags to enable each of these optimizations separately.

This is an example of how to turn-on probabilistic optimizations for each processor in the Dozer configuration.

flags:
  enable_probabilistic_optimizations:
    in_sets: true  # enable probabilistic optimizations in set operations (UNION, EXCEPT, INTERSECT); Default: false
    in_joins: true  # enable probabilistic optimizations in JOIN operations; Default: false
    in_aggregations: true  # enable probabilistic optimizations in aggregations (SUM, COUNT, MIN, etc.); Default: false

@abcpro1 abcpro1 force-pushed the optional-probabilistic-optimizations branch from c26da6e to 737d77e Compare August 24, 2023 17:15
@chubei chubei self-requested a review August 25, 2023 01:08
Copy link
Contributor

@chubei chubei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Two comments I hope we can discuss.

dozer-cli/src/simple/executor.rs Outdated Show resolved Hide resolved
…ML config

Probabilistic optimization sacrifices accuracy in order to reduce memory consumption. In certain parts of the pipeline, a Bloom Filter is used ([set_processor](https://github.com/getdozer/dozer/blob/2e3ba96c3f4bdf9a691747191ab15617564d8ca2/dozer-sql/src/pipeline/product/set/set_processor.rs#L20)), while in other parts, hash tables that store only the hash of the keys instead of the full keys are used ([aggregation_processor](https://github.com/getdozer/dozer/blob/2e3ba96c3f4bdf9a691747191ab15617564d8ca2/dozer-sql/src/pipeline/aggregation/processor.rs#L59) and [join_processor](https://github.com/getdozer/dozer/blob/2e3ba96c3f4bdf9a691747191ab15617564d8ca2/dozer-sql/src/pipeline/product/join/operator.rs#L57-L58)).

This commit makes these optimizations disabled by default and offers user-configurable flags to enable each of these optimizations separately.

This is an example of how to turn on probabilistic optimizations for each processor in the Dozer configuration.

```
flags:
  enable_probabilistic_optimizations:
    in_sets: true # enable probabilistic optimizations in set operations (UNION, EXCEPT, INTERSECT); Default: false
    in_joins: true # enable probabilistic optimizations in JOIN operations; Default: false
    in_aggregations: true # enable probabilistic optimizations in aggregations (SUM, COUNT, MIN, etc.); Default: false
```
@abcpro1 abcpro1 force-pushed the optional-probabilistic-optimizations branch from 737d77e to 74871e0 Compare August 25, 2023 06:10
@abcpro1 abcpro1 requested a review from chubei August 25, 2023 06:14
Copy link
Contributor

@chubei chubei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chubei chubei added this pull request to the merge queue Aug 25, 2023
Merged via the queue into getdozer:main with commit f5b6c7f Aug 25, 2023
18 checks passed
@abcpro1 abcpro1 deleted the optional-probabilistic-optimizations branch August 27, 2023 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow switching between accurate vs probabilistic model in pipeline
2 participants