Move exchange encryption for FTE to the engine#14941
Move exchange encryption for FTE to the engine#14941arhimondr merged 6 commits intotrinodb:masterfrom
Conversation
cf58703 to
861bf7d
Compare
There was a problem hiding this comment.
I do not think we should enforce that. It should be configurable, but enabled by default. I can easily think of a deployment scenario when you do not care about encryption at all if storage is not really "external"
There was a problem hiding this comment.
Agree, let me add an FTE specific property. I would prefer not to make this property generic though due to non obvious interactions between enabling HTTP's in the cluster and the exchange encryption.
There was a problem hiding this comment.
nit: the comment is not 100% valid right now (it is not mandatory any more)
There was a problem hiding this comment.
this probably should also be configurable (with defaulting to plain http), to enable different deployment scenarios.
There was a problem hiding this comment.
I thought about it. But then it becomes not clear what would happen when endpoit is explicitly specified to https but SSL disabled with a property. Thus I decided to keep it simple and default to HTTP when endpoint is not explicitly specified. But whenever HTTPs is desired the end user can point the system out to an HTTPS endpoint.
...change-filesystem/src/main/java/io/trino/plugin/exchange/filesystem/s3/ExchangeS3Config.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I am not familiar with this test. Just curious if we are not loosing some coverage with this change.
There was a problem hiding this comment.
The goal of this test is to ensure that the file size limit is properly applied at the connector level. Having a fixed number of tasks makes it more predictable while still validating the intended behaviour.
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerde.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
check that slice has byte array
There was a problem hiding this comment.
(are we sure that this always the case actually?)
There was a problem hiding this comment.
Yeah, it should always be the case as of today. I'm going to add a check.
There was a problem hiding this comment.
Are we ensuring somehow that source.getSlice() is at least blockSize long?
There was a problem hiding this comment.
Oh I think I misread. This is not a temporary space, but sth explicitly written when encrypting.
Still ti does not seem to me we are verifying that source.getSlice() looks valid (is long enough)
There was a problem hiding this comment.
I wanted to avoid additional checks for efficiency reasons. The idea is that if something is off it will fail down the stack (what is less readable, but probably a reasonable price to pay for eliminating extra checks, but maybe I'm over optimizing)
There was a problem hiding this comment.
do you need to return bytesPreserved. Isn't it just sink.available()?
There was a problem hiding this comment.
The idea is to preserve the bytes that haven't been read (for example if there are less than 8 bytes available and a single long is being read). This method returns the number of bytes preserved to offset the write index. It felt like it might be easier to follow this way, but yeah, it can also be implemented as:
int bytesPreserved = sink.available()
sink.rollOver();
I don't have a strong opinion here.
losipiuk
left a comment
There was a problem hiding this comment.
looks good. Some comments but nothing major. It feels like test coverage is good.
861bf7d to
e928149
Compare
|
Updated |
There was a problem hiding this comment.
I like the shorter version of property but it is kinda first time we would use acronyms in config.
On the other hand, the fault-tolerant-execution- prefix is super annoying.
@martint do you think it is ok to use fte- here (and rename other properties this way?). I thought about alternative but naming is hard.
There was a problem hiding this comment.
I'd prefer not to use acronyms. It's not a big deal if the name is long (alternatively, we could try to come up with a shorter, more descriptive name). However, it shouldn't be a prefix such that properties are all named fault-tolerant-execution-xxxx. Instead, we should turn it into a hierarchical property: fault-tolerant-execution.xxx (note the .)
|
One more question. Can the page size grow now compared to what it used to be (due to encryption). If so we may run into the problem coming from the default limit on HTTP response size imposed by airlift http server. I think it is 16MB by default. |
|
Benchmarks Details https://gist.github.com/arhimondr/b200e1d76752a2dc0f1063ab13863dcf |
It can by 256 bit (plaintext IV) + 248 bit (max padding) + 256 (encrypted IV for verification) + 32 bit (encrypted length) with the total of 99 bytes per block (64kb by default) what should in theory result into ~0.15% increase in page size. So in theory there's a tiny chance of pushing a query over the limit.
We don't enable encryption for pipelined execution. However in fault tolerant execution there's also a limit on max page size. But in fault tolerant execution we also enable compression that makes the limits even more "obscure". |
There was a problem hiding this comment.
Could swapping the source and sink buffers achieve the same goal without the need to copy the contents?
There was a problem hiding this comment.
In theory it should be possible, but only for the case when both, compression and encryption enabled. When encryption is disabled the sink buffer is the write buffer of a serialized page, and usually much larger than a temporary buffer used as an intermediate. Given how expensive compression and encryption is (10x decrease in throughput for serialize and 5x decrease in throughput for deserialize) not sure how much is to be gained here by avoiding a copy in this specific case.
There was a problem hiding this comment.
Small note for more potential improvements to serialization performance:
You could add a public method writeInts(int[] values, int offset, int length) for block encoders to use and implement it using writeBytes(Slices.wrappedIntArray(…)
(as well as other primitive array types)
There was a problem hiding this comment.
Currently the write[primitive] methods are only used for writing length prefixes before writing an actual data. Data stored in int[]/long[]/... arrays are written through write(Slice, ... that is performed by an unsafe memory copy.
There was a problem hiding this comment.
What if we don’t have any pending contents in the output buffer? Seems like encryption might force out a new block with the IV and padding.
There was a problem hiding this comment.
Good catch. Let me add a check.
There was a problem hiding this comment.
Actually it looks like the source buffer should never be empty at this point unless the serializer writes nothing (it is not possible today as the number of channels is always written unconditionally even for pages with 0 rows and 0 columns)
Assuming that serializer always writes something the buffer should always be non empty at this point, as the buffer is only flushed when it is not possible to write more. For example if the available space is 1 byte and 1 byte is about to be written it will be added to the buffer and flush wont be triggered immediate waiting for the buffer to be flushed on close.
There was a problem hiding this comment.
Looks like this is already added to maxEncryptedSize
There was a problem hiding this comment.
The IV block is actually written twice. Once in plain text explicitly (needed to initialize the Cipher for decryption) and once encrypted implicitly by the Cipher to perform decryption validation. The size of the encrypted IV written explicitly by the Cipher is accounted in maxEncryptedSize while the size of a plan text IV is not (as in theory it can be sent by an alternative communication channel).
There was a problem hiding this comment.
Gotcha, didn’t notice that in the logic while reviewing earlier. The extra IV for validation wouldn’t be necessary if we switched to AES/GCM and would also eagerly detect ciphertext corruption without a separate output checksumming scheme, but performance is I think just slightly worse than CBC last I checked.
9acbf55 to
4b957bf
Compare
|
Micro benchmarks: Before: After: |
Encryption is now done by the engine
The data for exchange is encrypted by the engine. If encrypted communication is preffered it can be enabled by specifying an https based endpoint explicitly.
Refactor PagesSerde to avoid: - Unnecessary memory copy - Unnecessary Cipher initialization - Unnecessary allocations when jumbo pages (>4MB) are serialized This is done by implementing block based encryption in compression. Instead of trying to encrypt/compress an entire page a page in serialized form is split into multiple fixed size blocks (64kb by default) Benchmark results Before: ``` Benchmark (compressed) (encrypted) (randomSeed) Mode Cnt Score Error Units BenchmarkPagesSerde.deserialize true true 1000 thrpt 10 1194.138 ± 7.442 ops/s BenchmarkPagesSerde.deserialize true false 1000 thrpt 10 2281.985 ± 8.634 ops/s BenchmarkPagesSerde.deserialize false true 1000 thrpt 10 1416.509 ± 2.928 ops/s BenchmarkPagesSerde.deserialize false false 1000 thrpt 10 5891.232 ± 14.932 ops/s BenchmarkPagesSerde.serialize true true 1000 thrpt 10 312.647 ± 0.800 ops/s BenchmarkPagesSerde.serialize true false 1000 thrpt 10 601.053 ± 1.522 ops/s BenchmarkPagesSerde.serialize false true 1000 thrpt 10 451.260 ± 0.996 ops/s BenchmarkPagesSerde.serialize false false 1000 thrpt 10 3446.756 ± 6.949 ops/s ``` After: ``` Benchmark (compressed) (encrypted) (randomSeed) Mode Cnt Score Error Units BenchmarkPagesSerde.deserialize true true 1000 thrpt 10 1732.440 ± 4.044 ops/s BenchmarkPagesSerde.deserialize true false 1000 thrpt 10 3059.334 ± 5.412 ops/s BenchmarkPagesSerde.deserialize false true 1000 thrpt 10 2093.146 ± 3.505 ops/s BenchmarkPagesSerde.deserialize false false 1000 thrpt 10 5906.148 ± 13.923 ops/s BenchmarkPagesSerde.serialize true true 1000 thrpt 10 329.150 ± 1.269 ops/s BenchmarkPagesSerde.serialize true false 1000 thrpt 10 617.507 ± 1.808 ops/s BenchmarkPagesSerde.serialize false true 1000 thrpt 10 493.182 ± 0.687 ops/s BenchmarkPagesSerde.serialize false false 1000 thrpt 10 3816.939 ± 7.984 ops/s ```
4b957bf to
b4352b2
Compare
Description
Encryption is more efficient to be done as part of the page serialization process. This allows to avoid unnecessary allocations and memory copies.
Non-technical explanation
N/A
Release notes
() This is not user-visible or docs only and no release notes are required.
(X) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: