feat: make probabilistic optimizations optional and tunable in the config #1912

abcpro1 · 2023-08-24T16:17:24Z

Resolves #1875.

Probabilistic optimization sacrifices accuracy in order to reduce memory consumption. In certain parts of the pipeline, a Bloom Filter is used (set_processor), while in other parts, hash tables that store only the hash of the keys instead of the full keys are used (aggregation_processor and join_processor).

This PR makes these optimizations disabled by default and offers user-configurable flags to enable each of these optimizations separately.

This is an example of how to turn-on probabilistic optimizations for each processor in the Dozer configuration.

flags:
  enable_probabilistic_optimizations:
    in_sets: true  # enable probabilistic optimizations in set operations (UNION, EXCEPT, INTERSECT); Default: false
    in_joins: true  # enable probabilistic optimizations in JOIN operations; Default: false
    in_aggregations: true  # enable probabilistic optimizations in aggregations (SUM, COUNT, MIN, etc.); Default: false

chubei

Looking great! Two comments I hope we can discuss.

dozer-cli/src/simple/executor.rs

dozer-sql/src/pipeline/utils/record_hashtable_key.rs

…ML config Probabilistic optimization sacrifices accuracy in order to reduce memory consumption. In certain parts of the pipeline, a Bloom Filter is used ([set_processor](https://github.com/getdozer/dozer/blob/2e3ba96c3f4bdf9a691747191ab15617564d8ca2/dozer-sql/src/pipeline/product/set/set_processor.rs#L20)), while in other parts, hash tables that store only the hash of the keys instead of the full keys are used ([aggregation_processor](https://github.com/getdozer/dozer/blob/2e3ba96c3f4bdf9a691747191ab15617564d8ca2/dozer-sql/src/pipeline/aggregation/processor.rs#L59) and [join_processor](https://github.com/getdozer/dozer/blob/2e3ba96c3f4bdf9a691747191ab15617564d8ca2/dozer-sql/src/pipeline/product/join/operator.rs#L57-L58)). This commit makes these optimizations disabled by default and offers user-configurable flags to enable each of these optimizations separately. This is an example of how to turn on probabilistic optimizations for each processor in the Dozer configuration. ``` flags: enable_probabilistic_optimizations: in_sets: true # enable probabilistic optimizations in set operations (UNION, EXCEPT, INTERSECT); Default: false in_joins: true # enable probabilistic optimizations in JOIN operations; Default: false in_aggregations: true # enable probabilistic optimizations in aggregations (SUM, COUNT, MIN, etc.); Default: false ```

chubei

#1912 (comment)

github-actions bot added the doc-update-needed label Aug 24, 2023

abcpro1 force-pushed the optional-probabilistic-optimizations branch from c26da6e to 737d77e Compare August 24, 2023 17:15

chubei self-requested a review August 25, 2023 01:08

chubei reviewed Aug 25, 2023

View reviewed changes

dozer-cli/src/simple/executor.rs Outdated Show resolved Hide resolved

dozer-sql/src/pipeline/utils/record_hashtable_key.rs Show resolved Hide resolved

abcpro1 force-pushed the optional-probabilistic-optimizations branch from 737d77e to 74871e0 Compare August 25, 2023 06:10

abcpro1 requested a review from chubei August 25, 2023 06:14

chubei reviewed Aug 25, 2023

View reviewed changes

chubei approved these changes Aug 25, 2023

View reviewed changes

chubei added this pull request to the merge queue Aug 25, 2023

Merged via the queue into getdozer:main with commit f5b6c7f Aug 25, 2023
18 checks passed

abcpro1 deleted the optional-probabilistic-optimizations branch August 27, 2023 22:21

chubei added the doc-update-completed label Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make probabilistic optimizations optional and tunable in the config #1912

feat: make probabilistic optimizations optional and tunable in the config #1912

abcpro1 commented Aug 24, 2023 •

edited

Loading

chubei left a comment

chubei left a comment

feat: make probabilistic optimizations optional and tunable in the config #1912

feat: make probabilistic optimizations optional and tunable in the config #1912

Conversation

abcpro1 commented Aug 24, 2023 • edited Loading

chubei left a comment

Choose a reason for hiding this comment

chubei left a comment

Choose a reason for hiding this comment

abcpro1 commented Aug 24, 2023 •

edited

Loading