Add compression for uploaded documentation #780

jyn514 · 2020-05-28T01:26:03Z

This needs some tests, but otherwise I think it should be fine as is, it was super simple after #643 :)

This uses zstd 5, but it's very easy to change the compression now and reasonably easy to change it after it's merged (we just have to change compression from a boolean into an enum and continue supporting zstd).

This compresses transparently, so that calling backend.get() automatically decompresses files as needed. If the files were not compressed before uploading, they are not decompressed when downloading.

Closes #379

cc @namibj
r? @pietroalbini

jyn514 · 2020-05-28T01:34:04Z

Oh, I just realized this has the wrong content-types on s3: It uses the original content types instead of zstd or something like that. I'll add the original content type as a field in the metadata.

jyn514 · 2020-05-28T01:38:54Z

Actually that seems like a lot of work for no benefit, I don't know that we'll ever actually use the zstd encoding for anything.

jyn514 · 2020-05-28T02:39:20Z

Some benchmarks:

Benchmarking compress regex html: Collecting 100 samples in estimated 7.5520 s (10k i                                                                                     compress regex html     time:   [676.22 us 725.12 us 779.81 us]
Found 16 outliers among 100 measurements (16.00%)
  16 (16.00%) high severe

Benchmarking decompress regex html: Collecting 100 samples in estimated 5.1476 s (91k                                                                                     decompress regex html   time:   [55.181 us 57.623 us 60.467 us]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

src/db/migrate.rs

src/storage/s3.rs

pietroalbini · 2020-05-28T13:02:52Z

The implementation looks good to me!

I agree with @Nemo157 that we should store {algorithm} instead of just compressed, and that we should use S3's native Content-Encoding to store it instead of a custom metadata.

Also, based on #379 (comment) the two options I was considering were zstd 9 or brotli 5 leaning torwards brotli as it uses 10% less disk space in that benchmark. You're using zstd 5 in the PR.

Nemo157 · 2020-05-28T13:18:22Z

The other option that @namibj was investigating was zstd with a custom dictionary optimized for rustdoc output, I can probably adapt the linked benchmarking script to test that as well with a dictionary created from the docs I have.

jyn514 · 2020-05-28T13:56:22Z

Oops, the benchmarks before got cut off. Here are some taken with a desktop instead of a laptop:

compress regex html     time:   [715.28 us 715.74 us 716.27 us]                                
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

decompress regex html   time:   [37.410 us 37.438 us 37.470 us]                                   
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

Nemo157 · 2020-05-29T07:48:13Z

Another idea might be to store a set of all the encodings used for a particular release into the release table, so we can easily lookup basic stats on compression usage without needing to go to S3. (A set to allow for different dictionaries based on filetype to be recorded).

jyn514 · 2020-05-29T14:33:24Z

store a set of all the encodings used for a particular release into the release table

This sounds like we'd only use it for metrics? I'm not opposed in principle but storing the same data in two places risks it getting out of sync.

Kixiron · 2020-05-30T02:10:26Z

Metrics don’t need to be a source of complexity, we can just as easily have an IntCounter that we increment with the total bytes saved by each build (plus additional if we’d like)

Nemo157 · 2020-05-31T13:16:59Z

It's not really for metrics like what goes in grafana, more for future visibility into usage distribution. Especially with if we migrate dictionaries in the future knowing how many releases use which dictionaries could be useful to determine whether it seems feasible to do something like migrate all instances of a certain dictionary to a new one to remove it from the application.

Given that releases are basically read-only currently I don't think there's much chance of this data getting out of sync.

Kixiron · 2020-05-31T15:30:48Z

How would that interact with us deleting docs in the future (if we ever decided to do that)?

jyn514 · 2020-05-31T16:05:54Z

How would that interact with us deleting docs in the future (if we ever decided to do that)?

No effect, we'd just delete the rows out of the database as well.

jyn514 · 2020-05-31T16:15:17Z

Another idea might be to store a set of all the encodings used for a particular release into the release table, so we can easily lookup basic stats on compression usage without needing to go to S3. (A set to allow for different dictionaries based on filetype to be recorded).

@Nemo157 a possible issue with this is it prevents using different compression algorithms for different files within a release. Does that sound worth it? Otherwise we'd have to store metadata for each file individually in the database which sort of the defeats the point of an S3 bucket.

Nemo157 · 2020-05-31T16:17:36Z

That's why I mention a set of encodings, we don't need to know exactly which files use which encoding, just that having this set of encodings is enough to access all files in the release (and if we want to know in detail then we would have to query the metadata on all the files).

jyn514 · 2020-06-05T12:46:47Z

I added the set of compression algorithms used as a many-to-many-table in the database, with a unique constraint so we don't end up with duplicates by accident.

I'm open to suggestions for more tests but otherwise I think I addressed all the comments.

jyn514 · 2020-06-05T12:48:26Z

As to using a custom zstd dictionary: it sounds like the rust zstd crate needs some fairly major changes to make that performant, I don't want to delay having any compression at all on that. If we want to use brotli instead for the storage savings that sounds reasonable but otherwise I think zstd 9 is fine.

src/db/migrate.rs

Nemo157 · 2020-06-05T12:59:28Z

I think it definitely makes sense to start with zstd now and add custom dictionary support later. Going with brotli now and moving to zstd custom dictionary later would require keeping both libraries around indefinitely.

Doing it in two steps also gives us the chance to see any performance impacts of each step.

One last thought I had was whether there should be a config var to disable compression, so if there are any issues in production that can just be toggled rather than having to do a rollback.

jyn514 · 2020-06-05T13:22:18Z

One last thought I had was whether there should be a config var to disable compression, so if there are any issues in production that can just be toggled rather than having to do a rollback.

It doesn't take very long to revert, about 5 minutes. In general I like the idea of having feature gates instead of "it's compiled in or it's not" but I'm a little wary of toggling features with environment variables, I'd rather come up with something more principled.

- Fix bad rebase - Remove warning in test mode

src/storage/mod.rs

- Don't require writing a database migration - Make tests not compile if they aren't exhaustive

jyn514 · 2020-06-11T00:33:45Z

Let me know if there are any more changes that need to be made, I think I addressed all the review comments.

jyn514 · 2020-06-11T20:58:07Z

I just found that this will prevent us from updating builds if we rebuild a crate:

thread 'main' panicked at 'Building documentation failed: Error(Db(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState("23505"), message: "duplicate key value violates unique constraint \"compression_rels_release_algorithm_key\"", detail: Some("Key (release, algorithm)=(1, 0) already exists."), hint: None, position: None, where_: None, schema: Some("public"), table: Some("compression_rels"), column: None, datatype: None, constraint: Some("compression_rels_release_algorithm_key"), file: Some("nbtinsert.c"), line: Some(434), routine: Some("_bt_check_unique") }))', src/bin/cratesfyi.rs:321:21

I think I need to add an ON CONFLICT DO UPDATE clause.

Previously, postgres would give a duplicate key error when a crate was built more than once: ``` thread 'main' panicked at 'Building documentation failed: Error(Db(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState("23505"), message: "duplicate key value violates unique constraint \"compression_rels_release_algorithm_key\"", detail: Some("Key (release, algorithm)=(1, 0) already exists."), hint: None, position: None, where_: None, schema: Some("public"), table: Some("compression_rels"), column: None, datatype: None, constraint: Some("compression_rels_release_algorithm_key"), file: Some("nbtinsert.c"), line: Some(434), routine: Some("_bt_check_unique") }))', src/bin/cratesfyi.rs:321:21 ``` Now, duplicate keys are discarded.

Kixiron reviewed May 28, 2020

View reviewed changes

src/db/migrate.rs Outdated Show resolved Hide resolved

Nemo157 reviewed May 28, 2020

View reviewed changes

src/storage/s3.rs Outdated Show resolved Hide resolved

jyn514 force-pushed the compression branch from 25d0d4e to f13a087 Compare May 28, 2020 13:38

jyn514 force-pushed the compression branch from daaba00 to d50d4f2 Compare June 5, 2020 12:45

jyn514 changed the title ~~[WIP] Add compression for uploaded documentation~~ Add compression for uploaded documentation Jun 5, 2020

Nemo157 reviewed Jun 5, 2020

View reviewed changes

src/db/migrate.rs Outdated Show resolved Hide resolved

jyn514 force-pushed the compression branch from a0fb326 to 3e11c7e Compare June 5, 2020 14:20

jyn514 added 5 commits June 7, 2020 09:58

Add compression for uploaded documentation

b4c1a28

Remove commented-out code

5c7d2ee

Add test for compression

f735e85

Add benchmark for compression/decompression

de4965e

zstd 5 -> zstd 9

1133541

jyn514 added 6 commits June 7, 2020 09:58

Make compression an enum instead of a boolean

256452d

Add compression algorithms to database

6b1a7c9

Fix tests

6433902

Fix compression when uploading to database backend

d72ffe3

Cleanup

189e6cd

- Fix bad rebase - Remove warning in test mode

Fix downgrade queries

9b5d200

jyn514 force-pushed the compression branch from 3e11c7e to 9b5d200 Compare June 7, 2020 14:00

pietroalbini requested changes Jun 8, 2020

View reviewed changes

src/storage/mod.rs Outdated Show resolved Hide resolved

jyn514 added 3 commits June 8, 2020 19:08

Make compression algorithms more extensible

ddfb0d6

- Don't require writing a database migration - Make tests not compile if they aren't exhaustive

Fix outdated SQL

8f6051f

Use macro-based enum for extensibility

869a36b

pietroalbini approved these changes Jun 11, 2020

View reviewed changes

jyn514 merged commit 8413226 into rust-lang:master Jun 11, 2020

jyn514 deleted the compression branch June 11, 2020 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compression for uploaded documentation #780

Add compression for uploaded documentation #780

jyn514 commented May 28, 2020 •

edited

Loading

jyn514 commented May 28, 2020

jyn514 commented May 28, 2020

jyn514 commented May 28, 2020

pietroalbini commented May 28, 2020

Nemo157 commented May 28, 2020

jyn514 commented May 28, 2020

Nemo157 commented May 29, 2020

jyn514 commented May 29, 2020

Kixiron commented May 30, 2020

Nemo157 commented May 31, 2020

Kixiron commented May 31, 2020

jyn514 commented May 31, 2020

jyn514 commented May 31, 2020

Nemo157 commented May 31, 2020

jyn514 commented Jun 5, 2020

jyn514 commented Jun 5, 2020 •

edited

Loading

Nemo157 commented Jun 5, 2020

jyn514 commented Jun 5, 2020

jyn514 commented Jun 11, 2020

jyn514 commented Jun 11, 2020

Add compression for uploaded documentation #780

Add compression for uploaded documentation #780

Conversation

jyn514 commented May 28, 2020 • edited Loading

jyn514 commented May 28, 2020

jyn514 commented May 28, 2020

jyn514 commented May 28, 2020

pietroalbini commented May 28, 2020

Nemo157 commented May 28, 2020

jyn514 commented May 28, 2020

Nemo157 commented May 29, 2020

jyn514 commented May 29, 2020

Kixiron commented May 30, 2020

Nemo157 commented May 31, 2020

Kixiron commented May 31, 2020

jyn514 commented May 31, 2020

jyn514 commented May 31, 2020

Nemo157 commented May 31, 2020

jyn514 commented Jun 5, 2020

jyn514 commented Jun 5, 2020 • edited Loading

Nemo157 commented Jun 5, 2020

jyn514 commented Jun 5, 2020

jyn514 commented Jun 11, 2020

jyn514 commented Jun 11, 2020

jyn514 commented May 28, 2020 •

edited

Loading

jyn514 commented Jun 5, 2020 •

edited

Loading