Skip to content

Conversation

@highker
Copy link

@highker highker commented Dec 24, 2019

HTTP is too unreliable to use for internal communication. Add thrift support.

== RELEASE NOTES ==

General Changes
* Allow Presto nodes to shuffle data with Thrift protocol. Use config `internal-communication.task-communication-protocol` to control between HTTP and Thrift.
* Allow Presto nodes to announce state with Thrift protocol. Use config `internal-communication.server-info-communication-protocol` to control between HTTP and Thrift.

@highker highker force-pushed the thrift2 branch 2 times, most recently from 62b3ce0 to ec44775 Compare December 24, 2019 22:48
@highker
Copy link
Author

highker commented Dec 24, 2019

Early benchmark result. Improved performance and reliability. No query failure in 5 hours with prod workload.

Screen Shot 2019-12-24 at 3 08 46 PM

Screen Shot 2019-12-24 at 3 07 55 PM

@wenleix
Copy link
Contributor

wenleix commented Dec 24, 2019

Early benchmark result. Improved performance and reliability. No query failure in 5 hours with prod workload.

@highker Nice win for latency! -- So I assume this improve wall time for light queries with little data for exchange?

Not sure how to interpret the first image though? Why still see "Worker Jetty Threads" even with Thrift PRC?

@highker
Copy link
Author

highker commented Dec 24, 2019

@wenleix

So I image this improve wall time for light queries with little data for exchange?

Ya, I think so.

Not sure how to interpret the first image though? Why still see "Worker Jetty Threads" even with Thrift PRC?

It just shows we used very few Jetty threads given Jetty thread pool is always full in prod. I didn't touch the task update/status part. So that tunnel still goes through HTTP. That should be easy to change as well on top of the framework.

@wenleix
Copy link
Contributor

wenleix commented Dec 24, 2019

It just shows we used very few Jetty threads given Jetty thread pool is always full in prod. I didn't touch the task update/status part. So that tunnel still goes through HTTP. That should be easy to change as well on top of the framework.

Yeah. Especially if we only want to migrate the RPC part (independent of migrating the encoding part, as there are recursive fields which Drift doesn't support well). Here is an POC of migrating HttpRemoteTask#sendUpdate (you probably also have a similar POC already~) : wenleix@8a562ff . Similar to what you changed, we need to have a more abstract RemoteTask that can switch between HttpXClient and ThriftXClient

@arhimondr
Copy link
Member

Nice!

One note:

Drift uses native memory to buffer the entire request / response. This is needed to run blocking Thrift encoding / decoding. We should be tracking native memory utilization very closely when switching to Thrift.

@mayankgarg1990
Copy link

I didn't completely understand the graphs uploaded above. Can you add more context around those graphs?

@highker
Copy link
Author

highker commented Dec 26, 2019

@mayankgarg1990, the figure is to show the cluster is running healthy and fast with 600 fanout. The cluster is not blocked by jetty threads anymore.

@arhimondr, that is a very good point. We will monitor that for sure!

@highker highker force-pushed the thrift2 branch 4 times, most recently from 3e59787 to acb46e5 Compare December 27, 2019 00:07
@highker highker changed the title Support exchange with thrift Support internal communication with thrift Dec 27, 2019
@highker highker requested a review from wenleix December 27, 2019 19:13
@wenleix
Copy link
Contributor

wenleix commented Jan 8, 2020

I will start to take a look at the PR. @tdcmeehan I am wondering if you are also interested in taking a look into this ? 😃

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first 3 commits LGTM. One question do we want to use "*" from artifacts id? Looks like we only use it when it's too cumbersome to list all artifacts. How many artifacts does netty have? :)

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Initial support with Thrift RPC" LGTM. but definitely need someone else to review :P .

IIRC the announcer based broadcast mechanism works well for worker node, but somehow doesn't work for coordinator (thinking about "TableFinishOperator" which has COOORDINATOR_ONLY partitioning). Maybe want to add a comment in TestingPrestoServer.java when doing thrift server port announcement ? :)

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add thrift support for exchange server". LGTM. minor comments.

@highker highker force-pushed the thrift2 branch 2 times, most recently from 8a86533 to bdaf87e Compare January 16, 2020 22:07
James Sun and others added 18 commits January 16, 2020 16:19
SyncMemoryBackend::handleCommand does not override parent method. Remove
the implementation to avoid netty dependency.
Introduce RpcShuffleClient that allows using different RPC for shuffle.
Refactor HttpPageBufferClient into PageBufferClient and
HttpRpcShuffleClient that implements RpcShuffleClient.
PageBufferClient now only handles page buffering and scheduling logic.
The actual RPC detail is handled in HttpRpcShuffleClient
The return result of acknowledgeResults does not need handling. Use a
future to make the call non-blocking instead of feeding it to a thread
pool.
@highker highker merged commit 20c7af7 into prestodb:master Jan 17, 2020
@caithagoras caithagoras mentioned this pull request Feb 20, 2020
8 tasks
@guhanjie
Copy link
Member

guhanjie commented Apr 27, 2022

Early benchmark result. Improved performance and reliability. No query failure in 5 hours with prod workload.

Screen Shot 2019-12-24 at 3 08 46 PM Screen Shot 2019-12-24 at 3 07 55 PM

@highker Could you explain the second pic?
I see the "Percent" asides the Y-axis, what's meaning?
Is it latency percentiles? but it seems that sum(Y) is not 100, :p, and what about the http with the same workload?
And, what's the directed differences(I mean the pros and cons) between http and thrift?

@tdcmeehan
Copy link
Contributor

@guhanjie my guess is the Y axis is percent of the workload, and we're seeing a subsection of the X axis which might explain why it doesn't sum to 100.

FWIW, we later discovered that while this change did improve performance, it also increased native memory usage. This means clusters needed to be restarted sooner than previously. This is because the Thrift communication library under the hood was allocating more and ever increasing native memory to copy the shuffle output. This is because to copy the shuffle data over Thrift, we needed to copy the data one more time than before.

After a certain point, we decided upon the following:

  1. Keep shuffle output as binary, but support async shuffles. This was an under the hood change to change the way we do shuffle over HTTP, but make it more dynamic and streaming. This avoids the need to copy the data one additional time.
  2. For coordinator-to-worker communication (all the other Task endpoints), we can use Thrift over HTTP. This is still work in progress, but progress is slowly being made, and eventually all such communication can happen over Thrift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants