Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explode #56

Merged
merged 3 commits into from
May 8, 2017
Merged

Explode #56

merged 3 commits into from
May 8, 2017

Conversation

frankmcsherry
Copy link
Member

@frankmcsherry frankmcsherry commented May 8, 2017

This PR introduces the explode operator, whose role in life is to move parts of the data that can be aggregated into the diff component, which differential dataflow will aggregate in-place. For example, if we had a collection of (name, salary) pairs and wanted the total salaries by name, the only way to do this within differential dataflow was

employees.group(|_name, salaries, output| 
    let sum = salaries.iter().map(|(sal,cnt)| sal * cnt).sum();
    output.push((sum, 1));
)

This is horrible for several reasons. Other than being verbose, differential dataflow is obliged to maintain the collection of salaries as distinct elements, because we could be computing the median or something horrible like that. We would prefer a way to explain to differential dataflow that it can accumulate the salary components.

The explode operator does this, mapping each input datum into a sequence of pairs (data, diff), and producing the collection that is the accumulation of all of the diffs for each of the datas produced. The example above would become

employees.explode(|(name, salary)| Some((name, salary)))
         .count();

In addition to being more efficient, this is meant to be more idiomatic, in that our program does not need to understand the underlying differences, etc. It also means that we can introduce other operators, filter, map, join, etc before doing the final count accumulation.

It may be that we need a more idiomatic name, but I'm not entirely sure what to use (accumulate, because the result is an accumulation?). Any thoughts here would be welcome.

@frankmcsherry frankmcsherry merged commit 0e26ecb into master May 8, 2017
@frankmcsherry frankmcsherry deleted the explode branch May 8, 2017 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant