Verify data integrity in exchanges#3438
Conversation
presto-main/src/main/java/io/prestosql/sql/analyzer/FeaturesConfig.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/HttpPageBufferClient.java
Outdated
Show resolved
Hide resolved
c7459b7 to
c8bb331
Compare
|
got a rough idea on performance penalty? |
presto-main/src/main/java/io/prestosql/operator/HttpPageBufferClient.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/server/PagesResponseWriter.java
Outdated
Show resolved
Hide resolved
It should be minimal. XXHash64 is a very efficient hash. On my Mac, it can process data at a rate of 14 GB per CPU second. |
|
CI green except for AC |
|
Rebased. |
presto-main/src/main/java/io/prestosql/server/PagesResponseWriter.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/TestingExchangeHttpClientHandler.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/execution/buffer/PagesSerdeUtil.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/HttpPageBufferClient.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/sql/analyzer/FeaturesConfig.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/MockExchangeRequestProcessor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
How does this surface to the user? It'd be ideal if it could be mapped to a proper error code.
There was a problem hiding this comment.
In practise, this is reachable only when checksum verification is OFF. This is then treated as if IO error occurred and retried.
presto-main/src/main/java/io/prestosql/operator/HttpPageBufferClient.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/HttpPageBufferClient.java
Outdated
Show resolved
Hide resolved
|
Updated the retry code. |
1960d08 to
65d5fae
Compare
|
Example failure (as visible in the final query state, eg in UI, or console) when running in default mode (ABORT): There is a one problem though. Before the query fails, the exception is logged multiple times. I don't see any easy fix for this and it is also pre-existing thing (failure mode is new, but the way |
|
In RETRY mode, the following gets logged (on the affected machine) |
There was a problem hiding this comment.
nit: aren't we typically ordering methods that usage is first and declaration follows?
There was a problem hiding this comment.
updateChecksum is placed right under writeSerializedPage, because they need to stay in sync.
i added a code comment
a6d4aeb to
bcc7267
Compare
In our testing, a cloud's network proved to be not reliable. We observed data corruption when transmitting data over TCP between Presto nodes (internal communication unsecured, no compression). Verify data integrity to prevent incorrect query results. Optionally retry when data corruption is detected.
|
I skimmed this. Seem good to me. |
In our testing, a cloud's network proved to be not reliable. We observed
data corruption when transmitting data over TCP.
Verify data integrity to prevent incorrect query results.
Optionally support retries when data corruption is detected.