Skip to content

Move exchange encryption for FTE to the engine#14941

Merged
arhimondr merged 6 commits intotrinodb:masterfrom
arhimondr:exchange-encryption
Nov 14, 2022
Merged

Move exchange encryption for FTE to the engine#14941
arhimondr merged 6 commits intotrinodb:masterfrom
arhimondr:exchange-encryption

Conversation

@arhimondr
Copy link
Copy Markdown
Contributor

Description

Encryption is more efficient to be done as part of the page serialization process. This allows to avoid unnecessary allocations and memory copies.

Non-technical explanation

N/A

Release notes

() This is not user-visible or docs only and no release notes are required.
(X) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Nov 7, 2022
@arhimondr arhimondr force-pushed the exchange-encryption branch 2 times, most recently from cf58703 to 861bf7d Compare November 7, 2022 22:54
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should enforce that. It should be configurable, but enabled by default. I can easily think of a deployment scenario when you do not care about encryption at all if storage is not really "external"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let me add an FTE specific property. I would prefer not to make this property generic though due to non obvious interactions between enabling HTTP's in the cluster and the exchange encryption.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment is not 100% valid right now (it is not mandatory any more)

Comment on lines 475 to 476
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should also be configurable (with defaulting to plain http), to enable different deployment scenarios.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it. But then it becomes not clear what would happen when endpoit is explicitly specified to https but SSL disabled with a property. Thus I decided to keep it simple and default to HTTP when endpoint is not explicitly specified. But whenever HTTPs is desired the end user can point the system out to an HTTPS endpoint.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is fine.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this test. Just curious if we are not loosing some coverage with this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this test is to ensure that the file size limit is properly applied at the connector level. Having a fixed number of tasks makes it more predictable while still validating the intended behaviour.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check that slice has byte array

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(are we sure that this always the case actually?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should always be the case as of today. I'm going to add a check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ensuring somehow that source.getSlice() is at least blockSize long?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I misread. This is not a temporary space, but sth explicitly written when encrypting.
Still ti does not seem to me we are verifying that source.getSlice() looks valid (is long enough)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid additional checks for efficiency reasons. The idea is that if something is off it will fail down the stack (what is less readable, but probably a reasonable price to pay for eliminating extra checks, but maybe I'm over optimizing)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to return bytesPreserved. Isn't it just sink.available()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to preserve the bytes that haven't been read (for example if there are less than 8 bytes available and a single long is being read). This method returns the number of bytes preserved to offset the write index. It felt like it might be easier to follow this way, but yeah, it can also be implemented as:

int bytesPreserved = sink.available()
sink.rollOver();

I don't have a strong opinion here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neither do i

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boundary checks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Copy Markdown
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. Some comments but nothing major. It feels like test coverage is good.

@arhimondr arhimondr force-pushed the exchange-encryption branch from 861bf7d to e928149 Compare November 8, 2022 18:37
@arhimondr
Copy link
Copy Markdown
Contributor Author

Updated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the shorter version of property but it is kinda first time we would use acronyms in config.
On the other hand, the fault-tolerant-execution- prefix is super annoying.

@martint do you think it is ok to use fte- here (and rename other properties this way?). I thought about alternative but naming is hard.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to use acronyms. It's not a big deal if the name is long (alternatively, we could try to come up with a shorter, more descriptive name). However, it shouldn't be a prefix such that properties are all named fault-tolerant-execution-xxxx. Instead, we should turn it into a hierarchical property: fault-tolerant-execution.xxx (note the .)

@losipiuk
Copy link
Copy Markdown
Member

losipiuk commented Nov 9, 2022

One more question. Can the page size grow now compared to what it used to be (due to encryption). If so we may run into the problem coming from the default limit on HTTP response size imposed by airlift http server. I think it is 16MB by default.

@arhimondr
Copy link
Copy Markdown
Contributor Author

Benchmarks

+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| base_cpu_time_millis | base_wall_time_millis | test_cpu_time_millis | test_wall_time_millis | cpu_diff | wall_diff |
+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
|          18184316267 |              77292978 |          18135469806 |              76886882 |  0.99731 |   0.99475 |
+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+

+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| suite                         | base_cpu_time_millis | base_wall_time_millis | test_cpu_time_millis | test_wall_time_millis | cpu_diff | wall_diff |
+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| tpcds_sf10000_partitioned     |           2151027639 |              10666623 |           2068382944 |               9833001 |  0.96158 |   0.92185 |
| tpcds_sf10000_partitioned_etl |          12575752138 |              45907876 |          12553147695 |              45802726 |  0.99820 |   0.99771 |
| tpcds_sf100_partitioned       |             25285611 |               1298684 |             24391767 |               1155394 |  0.96465 |   0.88967 |
| tpcds_sf100_partitioned_etl   |            112666635 |               5309658 |            117064847 |               5243526 |  1.03904 |   0.98754 |
| tpch_sf10000_bucketed         |            833789592 |               3341612 |            811495718 |               3408108 |  0.97326 |   1.01990 |
| tpch_sf10000_bucketed_etl     |           2459310174 |               9295363 |           2533861412 |               9857522 |  1.03031 |   1.06048 |
| tpch_sf100_bucketed           |              6506475 |                168296 |              5445624 |                152070 |  0.83695 |   0.90359 |
| tpch_sf100_bucketed_etl       |             19978003 |               1304866 |             21679799 |               1434535 |  1.08518 |   1.09937 |
+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+


Details

https://gist.github.com/arhimondr/b200e1d76752a2dc0f1063ab13863dcf

@arhimondr
Copy link
Copy Markdown
Contributor Author

arhimondr commented Nov 9, 2022

Can the page size grow now compared to what it used to be (due to encryption)

It can by 256 bit (plaintext IV) + 248 bit (max padding) + 256 (encrypted IV for verification) + 32 bit (encrypted length) with the total of 99 bytes per block (64kb by default) what should in theory result into ~0.15% increase in page size. So in theory there's a tiny chance of pushing a query over the limit.

If so we may run into the problem coming from the default limit on HTTP response size imposed by airlift http server

We don't enable encryption for pipelined execution. However in fault tolerant execution there's also a limit on max page size. But in fault tolerant execution we also enable compression that makes the limits even more "obscure".

@findepi findepi removed their request for review November 9, 2022 14:28
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could swapping the source and sink buffers achieve the same goal without the need to copy the contents?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory it should be possible, but only for the case when both, compression and encryption enabled. When encryption is disabled the sink buffer is the write buffer of a serialized page, and usually much larger than a temporary buffer used as an intermediate. Given how expensive compression and encryption is (10x decrease in throughput for serialize and 5x decrease in throughput for deserialize) not sure how much is to be gained here by avoiding a copy in this specific case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note for more potential improvements to serialization performance:

You could add a public method writeInts(int[] values, int offset, int length) for block encoders to use and implement it using writeBytes(Slices.wrappedIntArray(…)

(as well as other primitive array types)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the write[primitive] methods are only used for writing length prefixes before writing an actual data. Data stored in int[]/long[]/... arrays are written through write(Slice, ... that is performed by an unsafe memory copy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we don’t have any pending contents in the output buffer? Seems like encryption might force out a new block with the IV and padding.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let me add a check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it looks like the source buffer should never be empty at this point unless the serializer writes nothing (it is not possible today as the number of channels is always written unconditionally even for pages with 0 rows and 0 columns)

Assuming that serializer always writes something the buffer should always be non empty at this point, as the buffer is only flushed when it is not possible to write more. For example if the available space is 1 byte and 1 byte is about to be written it will be added to the buffer and flush wont be triggered immediate waiting for the buffer to be flushed on close.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is already added to maxEncryptedSize

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IV block is actually written twice. Once in plain text explicitly (needed to initialize the Cipher for decryption) and once encrypted implicitly by the Cipher to perform decryption validation. The size of the encrypted IV written explicitly by the Cipher is accounted in maxEncryptedSize while the size of a plan text IV is not (as in theory it can be sent by an alternative communication channel).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, didn’t notice that in the logic while reviewing earlier. The extra IV for validation wouldn’t be necessary if we switched to AES/GCM and would also eagerly detect ciphertext corruption without a separate output checksumming scheme, but performance is I think just slightly worse than CBC last I checked.

@arhimondr arhimondr force-pushed the exchange-encryption branch 2 times, most recently from 9acbf55 to 4b957bf Compare November 10, 2022 21:00
@arhimondr
Copy link
Copy Markdown
Contributor Author

Micro benchmarks:

Before:

Benchmark                        (compressed)  (encrypted)  (randomSeed)   Mode  Cnt     Score    Error  Units
BenchmarkPagesSerde.deserialize          true         true          1000  thrpt   10  1194.138 ±  7.442  ops/s
BenchmarkPagesSerde.deserialize          true        false          1000  thrpt   10  2281.985 ±  8.634  ops/s
BenchmarkPagesSerde.deserialize         false         true          1000  thrpt   10  1416.509 ±  2.928  ops/s
BenchmarkPagesSerde.deserialize         false        false          1000  thrpt   10  5891.232 ± 14.932  ops/s
BenchmarkPagesSerde.serialize            true         true          1000  thrpt   10   312.647 ±  0.800  ops/s
BenchmarkPagesSerde.serialize            true        false          1000  thrpt   10   601.053 ±  1.522  ops/s
BenchmarkPagesSerde.serialize           false         true          1000  thrpt   10   451.260 ±  0.996  ops/s
BenchmarkPagesSerde.serialize           false        false          1000  thrpt   10  3446.756 ±  6.949  ops/s

After:

Benchmark                        (compressed)  (encrypted)  (randomSeed)   Mode  Cnt     Score    Error  Units
BenchmarkPagesSerde.deserialize          true         true          1000  thrpt   10  1732.440 ±  4.044  ops/s
BenchmarkPagesSerde.deserialize          true        false          1000  thrpt   10  3059.334 ±  5.412  ops/s
BenchmarkPagesSerde.deserialize         false         true          1000  thrpt   10  2093.146 ±  3.505  ops/s
BenchmarkPagesSerde.deserialize         false        false          1000  thrpt   10  5906.148 ± 13.923  ops/s
BenchmarkPagesSerde.serialize            true         true          1000  thrpt   10   329.150 ±  1.269  ops/s
BenchmarkPagesSerde.serialize            true        false          1000  thrpt   10   617.507 ±  1.808  ops/s
BenchmarkPagesSerde.serialize           false         true          1000  thrpt   10   493.182 ±  0.687  ops/s
BenchmarkPagesSerde.serialize           false        false          1000  thrpt   10  3816.939 ±  7.984  ops/s

Encryption is now done by the engine
The data for exchange is encrypted by the engine. If encrypted
communication is preffered it can be enabled by specifying an https
based endpoint explicitly.
Refactor PagesSerde to avoid:

- Unnecessary memory copy
- Unnecessary Cipher initialization
- Unnecessary allocations when jumbo pages (>4MB) are serialized

This is done by implementing block based encryption in compression.
Instead of trying to encrypt/compress an entire page a page in
serialized form is split into multiple fixed size blocks (64kb by
default)

Benchmark results

Before:

```
Benchmark                        (compressed)  (encrypted)  (randomSeed)   Mode  Cnt     Score    Error  Units
BenchmarkPagesSerde.deserialize          true         true          1000  thrpt   10  1194.138 ±  7.442  ops/s
BenchmarkPagesSerde.deserialize          true        false          1000  thrpt   10  2281.985 ±  8.634  ops/s
BenchmarkPagesSerde.deserialize         false         true          1000  thrpt   10  1416.509 ±  2.928  ops/s
BenchmarkPagesSerde.deserialize         false        false          1000  thrpt   10  5891.232 ± 14.932  ops/s
BenchmarkPagesSerde.serialize            true         true          1000  thrpt   10   312.647 ±  0.800  ops/s
BenchmarkPagesSerde.serialize            true        false          1000  thrpt   10   601.053 ±  1.522  ops/s
BenchmarkPagesSerde.serialize           false         true          1000  thrpt   10   451.260 ±  0.996  ops/s
BenchmarkPagesSerde.serialize           false        false          1000  thrpt   10  3446.756 ±  6.949  ops/s
```

After:

```
Benchmark                        (compressed)  (encrypted)  (randomSeed)   Mode  Cnt     Score    Error  Units
BenchmarkPagesSerde.deserialize          true         true          1000  thrpt   10  1732.440 ±  4.044  ops/s
BenchmarkPagesSerde.deserialize          true        false          1000  thrpt   10  3059.334 ±  5.412  ops/s
BenchmarkPagesSerde.deserialize         false         true          1000  thrpt   10  2093.146 ±  3.505  ops/s
BenchmarkPagesSerde.deserialize         false        false          1000  thrpt   10  5906.148 ± 13.923  ops/s
BenchmarkPagesSerde.serialize            true         true          1000  thrpt   10   329.150 ±  1.269  ops/s
BenchmarkPagesSerde.serialize            true        false          1000  thrpt   10   617.507 ±  1.808  ops/s
BenchmarkPagesSerde.serialize           false         true          1000  thrpt   10   493.182 ±  0.687  ops/s
BenchmarkPagesSerde.serialize           false        false          1000  thrpt   10  3816.939 ±  7.984  ops/s
```
@arhimondr arhimondr merged commit 8262e35 into trinodb:master Nov 14, 2022
@arhimondr arhimondr deleted the exchange-encryption branch November 14, 2022 18:56
@github-actions github-actions bot added this to the 403 milestone Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants