Skip to content

Latest commit

 

History

History
16 lines (13 loc) · 1.55 KB

README.md

File metadata and controls

16 lines (13 loc) · 1.55 KB

Distinction

Distinction is a simple rust package which exports a single function which solves the Distinct Elements problem in optimal* space and time complexity from the work of Chakraborty, Vinodchandran, and Meel. The F0-Estimator algorithm efficiently estimates the number of distinct elements in a data stream using probabilistic techniques. It initializes a probability p and an empty set 𝒳, then iterates over the stream, probabilistically adding elements to 𝒳 with probability p. If the size of 𝒳 reaches a predefined threshold, elements are discarded with a 50% probability, and p is halved, this ensures that 𝒳 remains small, and we use the calculated value thresh as our upper bound on the size of the set. This process continues until the end of the stream, at which point the estimated number of distinct elements is calculated as the size of 𝒳 divided by p. This approach ensures space efficiency by keeping 𝒳 small, leveraging probabilistic sampling to handle large data streams effectively.

*My implementation is not the most optimal implementation of the algorithm due to performing the operations according to the paper semi-naively, so you could probably make something faster if you wanted to.

Example

use distinction::{Gen, find_n_distinct};
let mut gen = Gen::new(None);
let stream = vec![1, 10, 20, 10, 10, 30, 20, 10, 20, 20, 1, 1, 1];
let eps = 0.1;
let delta = 0.005;
let n_distinct = find_n_distinct(&stream, eps, delta, Some(gen));
assert_eq!(n_distinct, 4);