Skip to content

Allow forcing streaming exchange for Mark Distinct#14216

Merged
wenleix merged 2 commits intoprestodb:masterfrom
aweisberg:lbm_streaming_mark_distinct
Mar 11, 2020
Merged

Allow forcing streaming exchange for Mark Distinct#14216
wenleix merged 2 commits intoprestodb:masterfrom
aweisberg:lbm_streaming_mark_distinct

Conversation

@aweisberg
Copy link
Contributor

@aweisberg aweisberg commented Mar 5, 2020

Mark distinct is enabled by default and introduces a stage per distinct column in a query in order to redistribute the tracking of distinct values across the entire cluster. With materialized exchange the cost of these additional exchanges is higher. Enabling streaming exchange if there is sufficient memory available to perform the distincting will reduce the cost of these additional stages while still reducing the memory used by other by non-Mark Distinct stages.

== RELEASE NOTES ==

General Changes
* Allow forcing streaming exchange for Mark Distinct using 'query.use-streaming-exchange-for-mark-distinct'

@aweisberg aweisberg force-pushed the lbm_streaming_mark_distinct branch from 21d0ef0 to 7425200 Compare March 5, 2020 21:13
@wenleix wenleix self-requested a review March 5, 2020 22:18
@wenleix
Copy link
Contributor

wenleix commented Mar 5, 2020

Hmm.. ideally we want to add the third option into exchange_materialization_strategy:

  • ALL
  • NONE
  • ALL_BUT_DISTINCT

But I can totally see this would make operation / rollout totally a mess. So it makes sense to have this new stream_mark_distinct_with_materialized_exchange session property (name TBD...) :). Just want to make sure we discussed this :)

@aweisberg
Copy link
Contributor Author

I don't think this is an exchange materialization strategy because we may have several different permutations of whether we want to stream or materialize. Enumerating the permutations would be difficult under one config option.

A strategy would be something more like "CBO_BASED_MATERIALIZATION" that makes smarter decisions.

@aweisberg aweisberg force-pushed the lbm_streaming_mark_distinct branch from 7425200 to 76c3ea2 Compare March 6, 2020 22:15
@wenleix
Copy link
Contributor

wenleix commented Mar 9, 2020

I don't think this is an exchange materialization strategy because we may have several different permutations of whether we want to stream or materialize. Enumerating the permutations would be difficult under one config option.

@aweisberg : By "permutation" I assume you mean there are several different orthogonal dimensions about whether do stream or materialize ? (e.g. based on operator ? :) )

In my opinion it's still an exchange materialization strategy, however, we decompose the strategy space into several orthogonal options, such as decision on Mark Distinct, etc.

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Use static imports in TestAddExchangePlans". LGTM.

@rongrong
Copy link
Contributor

rongrong commented Mar 9, 2020

Not related to this change, but a related topic. We should change the default for use_mark_distinct to false.

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add stream mark distinct with LBM session property". One question about the condition.

For the commit message subject, "LBM" is an Facebook internal term, what about :

Allow stream exchange for mark distinct when materialized exchange is set

When MarkDistinctOperator is to execute queries with many COUNT(DISTINCT)
it requires separate exchange for every distinct COUNT(DISTINCT), which is 
prohibitively expensive for materialized exchange. 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What about useStreamExchangeForMarkDistinct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right technically that is what this does. I'll change to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, query.use-stream-exchange-for-mark-distinct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: useStreamExchangeForMarkDitinct as variable name ? :)

@aweisberg aweisberg force-pushed the lbm_streaming_mark_distinct branch 2 times, most recently from 35c8f41 to 58bf6d0 Compare March 10, 2020 15:28
@aweisberg aweisberg requested a review from arhimondr March 10, 2020 20:17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It should be fine to inline it after the static import is applied

@aweisberg
Copy link
Contributor Author

@rongrong Regarding changing the default of use_mark_distinct. Where peak_task_memory is an issue use_mark_distinct can help. For T1 this is more interesting because we have more distributed memory than task memory.

@aweisberg
Copy link
Contributor Author

@wenleix regarding the commit message. I was trying to keep the first line under 50 characters. That's 73.

@aweisberg
Copy link
Contributor Author

@wenleix Also the description seems too specific WRT to COUNT(DISTINCT)? You can distinct the inputs to any aggregation column.

@aweisberg aweisberg force-pushed the lbm_streaming_mark_distinct branch 2 times, most recently from 412ad85 to c7071fe Compare March 11, 2020 19:10
@aweisberg aweisberg force-pushed the lbm_streaming_mark_distinct branch from c7071fe to 48aa362 Compare March 11, 2020 19:13
@aweisberg
Copy link
Contributor Author

@wenleix Finished landing changes. I went with a slightly different commit message and useStreamingExchangeForMarkDistinct instead of useStreamExchangeForMarkDistinct.

@aweisberg aweisberg requested a review from wenleix March 11, 2020 19:15
@wenleix wenleix changed the title LBM streaming mark distinct Allow forcing streaming exchange for Mark Distinct Mar 11, 2020
Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will merge it once Travis is green.

(A side note -- do we know if this helps? 😃 )

When MarkDistinctOperator executes queries with distincted aggregation columns
it requires an exchange per column, which is which is prohibitively
expensive for materialized exchange.

Add "query.use-stream-exchange-for-mark-distinct" to force streaming
for mark distinct even if materialized exchange is enabled.
@aweisberg aweisberg force-pushed the lbm_streaming_mark_distinct branch from 48aa362 to 1e124c4 Compare March 11, 2020 20:25
@wenleix wenleix merged commit f9bf064 into prestodb:master Mar 11, 2020
@wenleix
Copy link
Contributor

wenleix commented Mar 11, 2020

Merged #14216, thanks for the contribution!

@caithagoras caithagoras mentioned this pull request Mar 29, 2020
9 tasks
@caithagoras caithagoras mentioned this pull request May 4, 2020
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants