Skip to content

Verify data integrity in exchanges#3438

Merged
findepi merged 2 commits intotrinodb:masterfrom
findepi:checksum-v3
Apr 30, 2020
Merged

Verify data integrity in exchanges#3438
findepi merged 2 commits intotrinodb:masterfrom
findepi:checksum-v3

Conversation

@findepi
Copy link
Copy Markdown
Member

@findepi findepi commented Apr 15, 2020

In our testing, a cloud's network proved to be not reliable. We observed
data corruption when transmitting data over TCP.

Verify data integrity to prevent incorrect query results.

Optionally support retries when data corruption is detected.

@cla-bot cla-bot bot added the cla-signed label Apr 15, 2020
@findepi findepi force-pushed the checksum-v3 branch 2 times, most recently from c7459b7 to c8bb331 Compare April 15, 2020 10:09
@tooptoop4
Copy link
Copy Markdown
Contributor

got a rough idea on performance penalty?

@martint
Copy link
Copy Markdown
Member

martint commented Apr 15, 2020

got a rough idea on performance penalty?

It should be minimal. XXHash64 is a very efficient hash. On my Mac, it can process data at a rate of 14 GB per CPU second.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Apr 15, 2020

CI green except for test-other-modules (started 5h 43m 20s ago)

AC

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Apr 16, 2020

Rebased.

@findepi findepi assigned electrum and martint and unassigned electrum and martint Apr 16, 2020
Copy link
Copy Markdown
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this surface to the user? It'd be ideal if it could be mapped to a proper error code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practise, this is reachable only when checksum verification is OFF. This is then treated as if IO error occurred and retried.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Apr 21, 2020

Updated the retry code.

@findepi findepi force-pushed the checksum-v3 branch 2 times, most recently from 1960d08 to 65d5fae Compare April 22, 2020 09:17
@findepi
Copy link
Copy Markdown
Member Author

findepi commented Apr 22, 2020

Example failure (as visible in the final query state, eg in UI, or console) when running in default mode (ABORT):

io.prestosql.spi.PrestoException: Checksum verification failure on 172.22.0.2 when reading from http://172.22.0.5:8081/v1/task/20200422_115923_00009_y5gz9.2.0/results/0/1: Data corruption, read checksum: 0x000003f7, calculated checksum: 0x20b0610e25ba9db5
	at io.prestosql.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:437)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1052)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.prestosql.operator.HttpPageBufferClient.ChecksumVerificationException: Data corruption, read checksum: 0x000003f7, calculated checksum: 0x20b0610e25ba9db5
	at io.prestosql.operator.HttpPageBufferClient$PageResponseHandler.verifyChecksum(HttpPageBufferClient.java:672)
	at io.prestosql.operator.HttpPageBufferClient$PageResponseHandler.handle(HttpPageBufferClient.java:650)
	at io.prestosql.operator.HttpPageBufferClient$PageResponseHandler.handle(HttpPageBufferClient.java:582)
	at io.airlift.http.client.jetty.JettyResponseFuture.processResponse(JettyResponseFuture.java:99)
	at io.airlift.http.client.jetty.JettyResponseFuture.completed(JettyResponseFuture.java:76)
	at io.airlift.http.client.jetty.BufferingResponseListener.onComplete(BufferingResponseListener.java:90)
	at org.eclipse.jetty.client.ResponseNotifier.notifyComplete(ResponseNotifier.java:218)
	at org.eclipse.jetty.client.ResponseNotifier.notifyComplete(ResponseNotifier.java:210)
	at org.eclipse.jetty.client.HttpReceiver.terminateResponse(HttpReceiver.java:543)
	at org.eclipse.jetty.client.HttpReceiver.terminateResponse(HttpReceiver.java:523)
	at org.eclipse.jetty.client.HttpReceiver.responseSuccess(HttpReceiver.java:486)
	at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.messageComplete(HttpReceiverOverHTTP.java:326)
	at org.eclipse.jetty.http.HttpParser.handleContentMessage(HttpParser.java:580)
	at org.eclipse.jetty.http.HttpParser.parseContent(HttpParser.java:1697)
	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1526)
	at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.parse(HttpReceiverOverHTTP.java:200)
	at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.process(HttpReceiverOverHTTP.java:141)
	at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.receive(HttpReceiverOverHTTP.java:75)
	at org.eclipse.jetty.client.http.HttpChannelOverHTTP.receive(HttpChannelOverHTTP.java:133)
	at org.eclipse.jetty.client.http.HttpConnectionOverHTTP.onFillable(HttpConnectionOverHTTP.java:156)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
	... 1 more

There is a one problem though. Before the query fails, the exception is logged multiple times.
This is because io.prestosql.operator.ExchangeClient#throwIfFailed rethrows the exception.

I don't see any easy fix for this and it is also pre-existing thing (failure mode is new, but the way
failures are handled isn't). I decided to ignore this for now. If the problem is as rare in practise as
it should be, this won't be a problem.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Apr 22, 2020

In RETRY mode, the following gets logged (on the affected machine)

2020-04-22T18:26:08.280+0545	WARN	page-buffer-client-callback-7	io.prestosql.operator.HttpPageBufferClient	Checksum verification failure on 172.23.0.5 when reading from http://172.23.0.5:8081/v1/task/20200422_124105_00003_8496b.2.1/results/0/3, may be retried: Data corruption, read checksum: 0x000003f5, calculated checksum: 0x20b0610e25ba9db5

@findepi findepi mentioned this pull request Apr 22, 2020
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: aren't we typically ordering methods that usage is first and declaration follows?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateChecksum is placed right under writeSerializedPage, because they need to stay in sync.
i added a code comment

@findepi findepi force-pushed the checksum-v3 branch 2 times, most recently from a6d4aeb to bcc7267 Compare April 22, 2020 18:08
In our testing, a cloud's network proved to be not reliable. We observed
data corruption when transmitting data over TCP between Presto nodes
(internal communication unsecured, no compression).

Verify data integrity to prevent incorrect query results.

Optionally retry when data corruption is detected.
@dain
Copy link
Copy Markdown
Member

dain commented Apr 29, 2020

I skimmed this. Seem good to me.

@dain dain removed their request for review April 29, 2020 00:46
@findepi findepi merged commit 4d80836 into trinodb:master Apr 30, 2020
@findepi findepi deleted the checksum-v3 branch April 30, 2020 23:34
@findepi findepi mentioned this pull request Apr 30, 2020
8 tasks
@electrum electrum added this to the 333 milestone May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants