Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compressed storage for rustdoc- and source-files #1342

Merged
merged 1 commit into from
Sep 12, 2021

Conversation

syphar
Copy link
Member

@syphar syphar commented Apr 4, 2021

I worked a little on the compressed storage topic, and this is a first design-draft.
It's working, so you can run builds, see the archives, and also click through rustdoc- or source-pages.

Before I'm finishing this up, I would love to hear your ideas / your feedback, so I know I'm going the right direction.

I squashed the commits since in reality I was jumping back and forth between topics all the time, so the commits wouldn't help you. So when you look at the code, the best way to understand what I thought is going in this order:

  1. new BZip2 compression support ( mostly in storage::compression + tests + bench)
  2. S3 & database storage backends: support for fetching the range of an object and decompress the range (mostly in storage::s3, storage::file, also in storage::Storage.get_range, some tests in storage::tests)
  3. an archive-index in storage::archive_index which supports storing file-names, their ranges and compression algorithms. Can be created based on zip, but is flexible if we want to use different formats later, or mix them.
  4. storage::Storage.get_index_for handles fetching the index for a certain archive-path, also creates a local cache for these index files. .exists queries to storage can be answered from that cache.
  5. storage::Storage.store_all_in_archive will take a folder and store it in a zip-file together with a new index (local and remote).
  6. storage::Storage.get_from_archive fetches a remote file from an archive. when the index is not cached, it is fetched first, then we only fetch and decompress the data.
  7. storage::Storage.exists_in_archive for exist-queries. When the index is local, without request to S3
  8. docbuilder::rustwide_builder starts creating two archives per release instead of uploading folders (one for the sources, one for the doc-build output). It also stores a bool into the release-table so we know the web handlers should look for archives
  9. web::rustdoc::rustdoc_html_server_handler can fetch files from the archive if the release data stated archive_storage.
  10. web::source::source_browser_handler also fetches files from the archive. Currently does a database fetch to know when to use archive-queries and when not. This could be changed to only checking for the local index cache, if we can guarantee that.

My general thoughts:

  • in the beginning I was actually thinking about using the zip central directory as the index, but that lead to some complexity (while keeping the file hashes). Here in this project the custom index, integrated with the existing compression logic made more sense.
  • as the sources and doc-output was separate before, I left it separate
  • I chose archives per release, because they won't have to be updated, only rebuilt
  • the index-format should be flexible enough to support future updates
  • I assumed we can trade local storage for a little performance by storing the indexes locally too
  • I'm not sure what the idea behind the compression-algorithm storage is, it looks like purely for statistics. I tried to insert similar data too for the archives.
  • I assumed the date_updated inside an archive is just the date_updated of the archive itself

IMHO everything around storage, compression and the index is in a good place, perhaps only needs some small tests around the errors that can happen.

Potential next steps / things I think have to be finished:

  • try CBOR and/or local compression of the index
  • research needed storage space for local indexes
  • depending on the result, add LRU cleanup with size limit in the cache (probably for all files, not only the index)
  • finishing up the rustdoc page, fixing the redirect (let's see how much I have to change for that, there is quite a mixup between remote (S3) paths and request paths in there).
  • add the used storage technique as a config var, so we can easily test the build with them.
  • tests for the archive-build
  • tests for source-handler and rustdoc changes

Questions/unsure:

  • I'm not sure about where to put the new archive_storage attribute in the mix between MetaData and CrateDetails.
  • should we use a totally separate path for the new archives? That would make cleanup easier
  • do we want to store/control in more detail where to use archives and where not?
  • should I clear the destination folders before uploading the archives? so a rebuild would replace the separate-file storage with the archives
  • should we use another storage format for the index? ( I saw CBOR being used in oubliette)
  • do we want to safeguard against problems by adding a data-hash to the index? (like CRC32 inside zip files)
  • do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)
  • should we add some LRU cleanup for the index? Or can we assume that we do have enough storage for all of them? (could be added later too)
  • do you see other things we need to add for operating this? command line tools?
  • are there any metrics you would like to see?

Potential next steps after this PR here:

  • we could replace the releases.files column with just getting and using the index
  • optionally clean remote storage for a version when rebuilding a crate
  • small archives could be downloaded and used locally
  • start rebuilding the newest version for all creates ( probably after stop building crates on all targets #343 ?)
  • at some point we could do Downloadable docs #174 (if this seems stable enough)).
  • remove unused local index files to safe storage (potentially LRU cache)

Fixes #1004

@jyn514
Copy link
Member

jyn514 commented Apr 4, 2021

an archive-index in storage::archive_index which supports storing file-names, their ranges and compression algorithms. Can be created based on zip, but is flexible if we want to use different formats later, or mix them.

I think the priority should be access and atomic updates, updates don't have to be fast. Using a zipfile makes me nervous the updates aren't atomic - I haven't read the code, do you know off the top of your head what happens if a request comes in while you're updating the archive?

web::source::source_browser_handler also fetches files from the archive. Currently does a database fetch to know when to use archive-queries and when not. This could be changed to only checking for the local index cache, if we can guarantee that.

What do you mean by "guarantee that"? Why would checking the local cache be faster than a database access?

in the beginning I was actually thinking about using the zip central directory as the index, but that lead to some complexity (while keeping the file hashes). Here in this project the custom index, integrated with the existing compression logic made more sense.

What's the difference between the zip central directory and the index? Where are each stored?

as the sources and doc-output was separate before, I left it separate
I chose archives per release, because they won't have to be updated, only rebuilt
I assumed the date_updated inside an archive is just the date_updated of the archive itself

👍, I don't actually know what we use date_updated for but that seems fine.

I assumed we can trade local storage for a little performance by storing the indexes locally too

We've been running into storage issues lately, but we also have very little storage on the prod machine - @pietroalbini would it be feasible to upgrade the storage from 100 GB to 150/200 or so? That should be enough.

@syphar how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.

should I clear the destination folders before uploading the archives? so a rebuild would replace the separate-file storage with the archives

Hmm, my only worry is that if the rebuild goes wrong somehow we'll lose the old docs. Since we build on nightly it may be hard to replicate the old builds, nightly is allowed to have breaking changes to unstable features. Maybe this could be optional? No need to implement it for now if it's a hassle.

do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)

Are we using a custom format for the index?

do we want to safeguard against problems by adding a data-hash to the index? (like CRC32 inside zip files)

Hmm - what benefit do you see from doing that? We don't have checksums currently, right?

I have more questions but I want to skim the diff first :)

Copy link
Member

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an archive-index in storage::archive_index which supports storing file-names, their ranges and compression algorithms. Can be created based on zip, but is flexible if we want to use different formats later, or mix them.

I think the priority should be access and atomic updates, updates don't have to be fast. Using a zipfile makes me nervous the updates aren't atomic - I haven't read the code, do you know off the top of your head what happens if a request comes in while you're updating the archive?

I misunderstood this the first time - the archive is per-release, so it will never be updated, only replaced.

how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.

I measured 949101 (~1 MB) for an index of stm32f4 and 2325 (~2 KB) for regex. So this would be between 250 MB and 90 GB for all crates (and much closer to the low end because stm32f4 is uncommonly large). I do think we should have an LRU cache if it ends up going much over 250 MB, but that will take a long time so I'm fine with not landing it in the initial draft.

do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)

Are we using a custom format for the index?

I didn't realize we were compressing the index itself. I think storing the compression format seems useful, but it looks like we already do that?

cratesfyi=# select compression from files where path = 'rustdoc/regex/rustdoc-regex-1.4.0.zip.index';
 compression 
-------------
           0
(1 row)

src/db/add_package.rs Outdated Show resolved Hide resolved
src/docbuilder/rustwide_builder.rs Outdated Show resolved Hide resolved
src/storage/archive_index.rs Outdated Show resolved Hide resolved
src/storage/archive_index.rs Outdated Show resolved Hide resolved
src/storage/archive_index.rs Outdated Show resolved Hide resolved
src/storage/mod.rs Outdated Show resolved Hide resolved
src/storage/mod.rs Outdated Show resolved Hide resolved
src/web/rustdoc.rs Outdated Show resolved Hide resolved
src/web/source.rs Outdated Show resolved Hide resolved
src/web/source.rs Outdated Show resolved Hide resolved
@jyn514
Copy link
Member

jyn514 commented Apr 4, 2021

These seem like real regressions:

[2021-04-04T13:12:32Z TRACE docs_rs::web::page::templates] Finished loading templates
[2021-04-04T13:12:32Z INFO  docs_rs::web] Running docs.rs web server on http://127.0.0.1:33855
[2021-04-04T13:12:32Z DEBUG docs_rs::web::rustdoc] got error serving tokio/time: path not found
thread 'web::rustdoc::test::test_no_trailing_rustdoc_slash' panicked at '/tokio/0.2.21/tokio/time: expected redirect to /tokio/0.2.21/tokio/time/index.html, got redirect to /tokio/0.2.21/tokio/', src/test/mod.rs:75:13

thread 'web::rustdoc::test::go_to_latest_version_keeps_platform' panicked at 'assertion failed: `(left == right)`
  left: `"/crate/dummy/0.2.0/target-redirect/x86_64-unknown-linux-gnu/dummy/index.html"`,
 right: `"/crate/dummy/0.2.0/target-redirect/x86_64-pc-windows-msvc/dummy/index.html"`', src/web/rustdoc.rs:867:13

@syphar
Copy link
Member Author

syphar commented Apr 5, 2021

web::source::source_browser_handler also fetches files from the archive. Currently does a database fetch to know when to use archive-queries and when not. This could be changed to only checking for the local index cache, if we can guarantee that.

What do you mean by "guarantee that"? Why would checking the local cache be faster than a database access?

you're right, in my mind I'm always thinking about network roundtrips for databases, which we don't have here.

in the beginning I was actually thinking about using the zip central directory as the index, but that lead to some complexity (while keeping the file hashes). Here in this project the custom index, integrated with the existing compression logic made more sense.

What's the difference between the zip central directory and the index? Where are each stored?

The ZIP directory is just the end of the file. It contains all the filenames, and also the positions to fetch. I had a working prototype that had a kind of virtual buffer which prefetched the end of the archive and then the range for the file. After that worked, it seemed more complex than just using our own index, and decompressing the stream ourselves.

should I clear the destination folders before uploading the archives? so a rebuild would replace the separate-file storage with the archives

Hmm, my only worry is that if the rebuild goes wrong somehow we'll lose the old docs. Since we build on nightly it may be hard to replicate the old builds, nightly is allowed to have breaking changes to unstable features. Maybe this could be optional? No need to implement it for now if it's a hassle.

I'll add it to the after this PR list

do we want to safeguard against problems by adding a data-hash to the index? (like CRC32 inside zip files)

Hmm - what benefit do you see from doing that? We don't have checksums currently, right?

It would only be a safeguard against potential problems when fetching the ranges. My feeling would be that we're fine without.

I assumed we can trade local storage for a little performance by storing the indexes locally too

We've been running into storage issues lately, but we also have very little storage on the prod machine - @pietroalbini > would it be feasible to upgrade the storage from 100 GB to 150/200 or so? That should be enough.

how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.

I measured 949101 (~1 MB) for an index of stm32f4 and 2325 (~2 KB) for regex. So this would be between 250 MB and 90 GB for all crates (and much closer to the low end because stm32f4 is uncommonly large). I do think we should have an LRU cache if it ends up going much over 250 MB, but that will take a long time so I'm fine with not landing it in the initial draft.

I'll experiment a little with using a different format (like CBOR), and/or also compressing the local index (currently it's plain text). Depending on that we can decide if the cleanup is a topic for now or after.

do you see value in adding versioning to the index? (so when the format changes in a backwards incompatible way)

Are we using a custom format for the index?

I didn't realize we were compressing the index itself. I think storing the compression format seems useful, but it looks like we already do that?

cratesfyi=# select compression from files where path = 'rustdoc/regex/rustdoc-regex-1.4.0.zip.index';
 compression 
-------------
           0
(1 row)

The compression on S3 is definitely already there, also in the content-encoding header on the remote storage. I was more talking about the format (json fields etc). But IMHO we don't really need it.

@syphar syphar added the S-waiting-on-author Status: This PR is incomplete or needs to address review comments label Apr 13, 2021
@syphar syphar changed the title (RFC) Start implementing compressed storage for rustdoc- and source-files compressed storage for rustdoc- and source-files Apr 13, 2021
@syphar syphar self-assigned this Apr 14, 2021
@syphar syphar added A-backend Area: Webserver backend A-builds Area: Building the documentation for a crate labels Apr 14, 2021
@syphar
Copy link
Member Author

syphar commented May 17, 2021

I could free a little time to continue on this PR.

assumed we can trade local storage for a little performance by storing the indexes locally too

We've been running into storage issues lately, but we also have very little storage on the prod machine - @pietroalbini > would it be feasible to upgrade the storage from 100 GB to 150/200 or so? That should be enough.
how much storage are we talking about? Anything less than a GB is probably fine and doesn't need changes to the prod machine.

I measured 949101 (~1 MB) for an index of stm32f4 and 2325 (~2 KB) for regex. So this would be between 250 MB and 90 GB for all crates (and much closer to the low end because stm32f4 is uncommonly large). I do think we should have an LRU cache if it ends up going much over 250 MB, but that will take a long time so I'm fine with not landing it in the initial draft.

I'll experiment a little with using a different format (like CBOR), and/or also compressing the local index (currently it's plain text). Depending on that we can decide if the cleanup is a topic for now or after.

I changed the format to CBOR which saved some storage.
I tested with the itertools crate (1590 files) and ended up with a local index of 26005 bytes. Compressing CBOR actually increases the size.

Coming from the size of the S3 bucket (~400 million files) we would end up having a maximum of ~6.2 GiB.
This assumes similar lengths of file-names across crates, and in the source-packages. Also we would have to substract the files for the nightly specific static files which are not in archives right now.

So I agree we should have a cleanup (LRU or something else) at some point, but we could add this later.

@syphar
Copy link
Member Author

syphar commented May 25, 2021

I just saw one thing I changed didn't compile :)
Will fix later. There is plenty of other things to improve anyways before this can have another review

@teohhanhui
Copy link

BZip2 compression support

Why not zstd?

@syphar
Copy link
Member Author

syphar commented Jun 9, 2021

BZip2 compression support

Why not zstd?

I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP.

While inside the ZIP archive format we can use different compression algorithms, I wanted to chose an algorithm that is more widely supported by zip decompression tools (see facebook/zstd#1378 for ZIP itself supporting zstd).

BZip2 is far better than deflate, and seemed to be a good starting point.

In general the idea was also to directly reuse the archives for downloadable docs (see #174 ), though I still need to validate my assumptions in that direction.

There was already some work by @Nemo157 regarding an archive-format with zstd ( see https://github.com/Nemo157/oubliette ), but that would be a completely custom format.

@teohhanhui
Copy link

teohhanhui commented Jun 15, 2021

I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP.

Found https://github.com/martinellimarco/t2sz from facebook/zstd#395

The compressed archive can be uncompressed with any Zstandard tool, including zstd.

The original genesis of that seems to be here: mxmlnkn/ratarmount#40

@syphar
Copy link
Member Author

syphar commented Jun 15, 2021

I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP.

Found https://github.com/martinellimarco/t2sz from facebook/zstd#395

Please correct me if I'm wrong, but to my knowledge .tar.zstd doesn't support range requests. in the end we want to fetch only the part of the archive that represents the file that we want to show.

The compressed archive can be uncompressed with any Zstandard tool, including zstd.

The original genesis of that seems to be here: mxmlnkn/ratarmount#40

I'll dig into your links later, thank you for these. A quick look didn't show me how this would support range requests?

I assume we would use ratarmount / indexed_zstd and let it make range-requests to S3 after every seek.
Since zstd is not compressing by file, but only the whole tar-archive I assume we cannot just fetch the range where we find our file, but we also have to fetch some per-archive directory/wordlist.

Our goal is to keep the current page speed, which means we are limited to a maximum of one S3 request per rustdoc file to fetch.

@jyn514
Copy link
Member

jyn514 commented Jun 15, 2021

We also have the option to switch compression formats in the future if it turns out that our storage costs are growing faster than expected. But I agree with @syphar, having the smallest possible archive size is not a goal at the moment.

@jyn514
Copy link
Member

jyn514 commented Jun 24, 2021

start rebuilding the newest version for all creates ( probably after #343 ?)

Alternatively, we could keep the defaults the same and only rebuild the default platform.

@bjornharrtell
Copy link

Unless I'm mistaken I think https://github.com/lz4/lz4 supports random access.

@syphar
Copy link
Member Author

syphar commented Jul 21, 2021

Unless I'm mistaken I think https://github.com/lz4/lz4 supports random access.
lz4 looks like a normal stream compression format, how would adding multiple files look like in it?
That would mean we have to store the blob and the index side-by-side.

Using a container-format like ZIP has the advantage of being self-contained.

@danieldjewell
Copy link

I wanted the archive to be a standard archive format which also supports range request. And for that there is only ZIP.

Found https://github.com/martinellimarco/t2sz from facebook/zstd#395

Please correct me if I'm wrong, but to my knowledge .tar.zstd doesn't support range requests. in the end we want to fetch only the part of the archive that represents the file that we want to show.

If it's compressed with t2sz then, in theory, you could use range requests to get the specific ZSTD block and then decompress that... although I'm not 100% sure.

I assume we would use ratarmount / indexed_zstd and let it make range-requests to S3 after every seek.

Not to throw a wrench in the works at this point, but I'm not sure that from an architectural point of view this is a good idea... (See my comment in #174 about the compression ratios of a solid ZSTD archive and of a t2sz converted/indexed archive.)

Indexed GZIP is used widely in the Bioinformatics world for storing genetic data -- see https://github.com/samtools/htslib/blob/develop/bgzip.c ... BGZF is an indexed GZIP variant that supports random access. (See https://github.com/samtools/hts-specs and http://samtools.github.io/hts-specs/SAMv1.pdf for info on how this is applied).

Another indexed binary format very widely used in the Meteorology world is GRIB2 (Gridded Binary 2) -- the main output format for forecast systems like NOAA/NWS's GFS (Global Forecast System) ... See https://nomads.ncep.noaa.gov/txt_descriptions/fast_downloading_grib_doc.shtml for an example.

The GRIB2 format uses a simple index that looks like this:

1:0:d=2021081006:PRMSL:mean sea level:anl: 
2:878656:d=2021081006:CLWMR:1 hybrid level:anl: 
3:979138:d=2021081006:ICMR:1 hybrid level:anl: 
4:1168467:d=2021081006:RWMR:1 hybrid level:anl: 
5:1438719:d=2021081006:SNMR:1 hybrid level:anl: 
6:1518142:d=2021081006:GRLE:1 hybrid level:anl: 
7:1568632:d=2021081006:REFD:1 hybrid level:anl: 
8:2442948:d=2021081006:REFD:2 hybrid level:anl:
9:3317324:d=2021081006:REFC:entire atmosphere:anl: 

The format is:

<record#>:<byteOffsetStart>:<date>:<DataFieldName>:<DataAtmosphericLevel>:<DataTimeOffset>:

(where DataFieldName is the type of data -- PRMSL = Pressure @ Mean Sea Level; Atmospheric Level = The pressure level of the prediction (200mb, 500mb, etc.); Offset = hours into the future (anl=analysis="at the time of the publication of this analysis"))

I would also look at THREDDS (https://github.com/Unidata/tds) and Siphon (https://github.com/Unidata/siphon) which allow for the subsetting/indexing of geospatial gridded forecast data (see https://unidata.github.io/python-training/gallery/500hpa_hght_winds/ -- xarray has some of the indexing and subsetting built right in -- it's transparent to the end user.)

I know these might seem unrelated but they're pretty complete (and complex) examples of ways that the issue of retrieving a small amount of data from a very large file have been successfully accomplished.

Since zstd is not compressing by file, but only the whole tar-archive I assume we cannot just fetch the range where we find our file, but we also have to fetch some per-archive directory/wordlist.

Yes, that would make sense to me - a directory/index would seemingly be required...

Our goal is to keep the current page speed, which means we are limited to a maximum of one S3 request per rustdoc file to fetch.

And this point is where I question the architecture... Because it seems like there's an opportunity to maybe reduce the number of S3 requests overall...

I see a couple of theoretical options:

  1. A crate's documentation is stored in an indexed/compressed tar format
    A. Retrieve the index file/data (perhaps stored in a database?)
    B. Determine what data is required
    C. Fetch, decompress, and serve up the file (or files -- if s3 supports an http range request with multiple ranges in a single request)

  2. Same as above or stored in a non-indexed format (server side processing)
    A. Retrieve the entire compressed archive
    B. Extract the data you need; serve it up
    C. (optional) cache extracted static decompressed files in memory if frequently accessed?

  3. Same as above but moving the decompression to the client side...
    A. Use JavaScript + an index file to request either all or part of the compressed documentation (all for caching... and also offline access)
    B. Extract the content client-side and display
    C. (use some logic to decide if sending all the docs is a good idea -- sending 20MB compressed might be excessive, unless someone is working on an stm32 project for example)

Option 3 offers some interesting options -- if the entire crate compressed documentation is sent to the client and if that client views more than one page, you're saving S3 requests, are you not? Likewise, if that compressed data is cached client side and someone frequently accesses it.... No requests required (other than maybe a "hey is my locally cached copy still up-to-date?" query).

(Sorry this got kinda long, but I wanted to be clear and complete 😁)

@danieldjewell
Copy link

After digging in a bit further, I was able to (pretty easily, given that I'm pretty much a Rust novice) add an archival step (outputting a tar.zst) to the build process in src/docbuilder/rustwide_builder.rs

I did not realize, however, that the output is not flat and complete at that point -- it relies on the web server component of docs.rs to actually render valid HTML. (But perhaps there's an easy way to make static docs... cargo doc outputs static HTML)

I guess the point is that storing the generated doctree (e.g. the files in rustdoc/{name}/{version}/) is pretty straightforward. Retrieving them within the context of docs.rs's web server is another story.

Random question: Does the DB/Webserver component of Docs.rs really offer a huge benefit in terms of end-user functionality that couldn't be handled with static files/client side AJAX? Most documentation generators (e.g. Sphinx, etc.) generate to static HTML and then that's what gets served up on github.io or whereever... (Not saying that's the only way, but... there is a major advantage: speed).

Maybe the direction for docs.rs should be to move the web serving to a web server (take your pick) and have docs.rs generate content and/or act as an API for the database stuff. I recognize that's a pretty big structural change but it would also offer some flexibility/scalability options. (I don't know that having clients access S3 directly is always a good idea [S3 is OK as a CDN I guess...] ... but the performance of apache/nginx when serving up static files is incredible compared to dynamic content... I also have to believe it's way faster than processing the requests through docs.rs's webserver.)

@syphar
Copy link
Member Author

syphar commented Aug 11, 2021

@jyn514 some notes on this merge here:

since the last review is some time ago, it's perhaps best not to look through the changes, but the whole PR again, but that's up to you

here some things that are changed (from memory, so probably incomplete)

  • I moved some rustdoc/source specific logic into the storage layer, and now the web-handlers look better IMHO
  • used storage technique for new builds is a setting
  • index files are smaller now (due to CBOR)
  • test-fakes can generate releases with archive-storage
  • the build is tested with plain file-storage and archive-storage separately
  • web::source and web::rustdoc tests are tested with both storage techniques for the tests that I guessed use storage. ( I used test case parametrization via test_case for this, I'm happy to implement better ideas :) )

There are some of the rustdoc tests failing, I'll dig into this.

@jyn514
Copy link
Member

jyn514 commented Aug 15, 2021

I did not realize, however, that the output is not flat and complete at that point -- it relies on the web server component of docs.rs to actually render valid HTML. (But perhaps there's an easy way to make static docs... cargo doc outputs static HTML)

No, there is not an easy way to make it static: the header is dynamic and changes whenever a new version is published.

Random question: Does the DB/Webserver component of Docs.rs really offer a huge benefit in terms of end-user functionality that couldn't be handled with static files/client side AJAX?

Well, I personally would be very disappointed if the 'go to latest version' link went away and you had to explicitly click through to /crate to see what the latest version is. People have enough trouble finding /crate already, we don't need to move more core functionality there.

If we do this through AJAX only, that means people can no longer disable javascript. See #845 (comment) for an idea I had about how to avoid that, but it's completely separate from compression.

Option 3 offers some interesting options -- if the entire crate compressed documentation is sent to the client and if that client views more than one page, you're saving S3 requests, are you not? Likewise, if that compressed data is cached client side and someone frequently accesses it.... No requests required (other than maybe a "hey is my locally cached copy still up-to-date?" query).

This is only feasible if the crate docs are small enough to send to the client. We have several crates with millions of pages and multiple gigabytes of docs. We've discussed caching the whole archive on the build server in the past, it should be feasible as long as the archive is small enough: #1004 (comment).

Please move discussion about rethinking the entire docs.rs website to the tracking issue (#1004) instead of the PR, it makes it harder to find what progress is being made.

@rust-lang rust-lang locked as off-topic and limited conversation to collaborators Aug 15, 2021
Copy link
Member

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bunch of implementation details I commented on but this looks like a good approach :) I tested it locally and it works great. I think we're getting close.

.github/workflows/ci.yml Outdated Show resolved Hide resolved
src/docbuilder/rustwide_builder.rs Outdated Show resolved Hide resolved
src/docbuilder/rustwide_builder.rs Outdated Show resolved Hide resolved
src/storage/archive_index.rs Outdated Show resolved Hide resolved
src/storage/archive_index.rs Outdated Show resolved Hide resolved
src/storage/mod.rs Show resolved Hide resolved
src/storage/mod.rs Show resolved Hide resolved
src/test/fakes.rs Outdated Show resolved Hide resolved
src/web/source.rs Show resolved Hide resolved
src/config.rs Outdated Show resolved Hide resolved
@jyn514 jyn514 mentioned this pull request Aug 16, 2021
src/storage/mod.rs Show resolved Hide resolved
src/web/source.rs Show resolved Hide resolved
src/web/source.rs Outdated Show resolved Hide resolved
src/config.rs Outdated Show resolved Hide resolved
src/docbuilder/rustwide_builder.rs Outdated Show resolved Hide resolved
@jyn514
Copy link
Member

jyn514 commented Sep 9, 2021

r=me with the nit fixed :) tests are failing for some reason though:

---- web::rustdoc::test::test_no_trailing_target_slash::_true stdout ----
 [2021-08-18T20:43:43Z ERROR docs_rs::web::page::templates] Failed to load rustc resource suffix: missing rustc version
    
    Stack backtrace:
       0: docs_rs::web::page::templates::load_rustc_resource_suffix
                 at ./src/web/page/templates.rs:84:9
       1: docs_rs::web::page::templates::load_templates
                 at ./src/web/page/templates.rs:138:23
       2: docs_rs::web::page::templates::TemplateData::new
                 at ./src/web/page/templates.rs:35:46
       3: docs_rs::web::Server::start
                 at ./src/web/mod.rs:434:38
        4: docs_rs::test::TestFrontend::new
                 at ./src/test/mod.rs:379:21

@jyn514
Copy link
Member

jyn514 commented Sep 9, 2021

Hmm, they work for me locally - let me rerun it and see if it helps.

@syphar
Copy link
Member Author

syphar commented Sep 9, 2021

r=me with the nit fixed :) tests are failing for some reason though:

---- web::rustdoc::test::test_no_trailing_target_slash::_true stdout ----
 [2021-08-18T20:43:43Z ERROR docs_rs::web::page::templates] Failed to load rustc resource suffix: missing rustc version
    
    Stack backtrace:
       0: docs_rs::web::page::templates::load_rustc_resource_suffix
                 at ./src/web/page/templates.rs:84:9
       1: docs_rs::web::page::templates::load_templates
                 at ./src/web/page/templates.rs:138:23
       2: docs_rs::web::page::templates::TemplateData::new
                 at ./src/web/page/templates.rs:35:46
       3: docs_rs::web::Server::start
                 at ./src/web/mod.rs:434:38
        4: docs_rs::test::TestFrontend::new
                 at ./src/test/mod.rs:379:21

I'm still working through the test failures, wasn't able to do as much as I wanted the last weeks.

@jyn514
Copy link
Member

jyn514 commented Sep 9, 2021

@syphar I did some debugging - the target-redirect page is still working correctly, the problem is that

let blob = match storage.fetch_rustdoc_file(&name, &version, &path, krate.archive_storage) {
is returning false for path = x86_64-apple-darwin/dummy/index.html, name = dummy, version = 0.2.0 (whether or not archive_storage is true).

Which is wrong because that file is still being uploaded:

[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] adding directory Rustdoc from /tmp/docs.rs-fakePzis7K
[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] added rustdoc files [["text/html","dummy/index.html"]]
[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] writing file /tmp/docs.rs-fakeQ4D9hQ/x86_64-apple-darwin/dummy/index.html
[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] adding directory Rustdoc from /tmp/docs.rs-fakeQ4D9hQ/x86_64-apple-darwin
[2021-09-09T13:27:12Z DEBUG docs_rs::test::fakes] added platform files for x86_64-apple-darwin
[2021-09-09T13:27:12Z DEBUG docs_rs::db::add_package] Adding package into database
[2021-09-09T13:27:12Z INFO  docs_rs::db::add_package] Updating crate data for dummy
[2021-09-09T13:27:12Z DEBUG docs_rs::db::add_package] Adding build into database
[2021-09-09T13:27:12Z DEBUG docs_rs::web::rustdoc] got error serving x86_64-apple-darwin/dummy/index.html: path not found

In particular, this query is returning no rows:

self.pool.get()?.query(

for path = rustdoc/dummy/0.2.0/x86_64-apple-darwin/dummy/index.html

src/test/fakes.rs Outdated Show resolved Hide resolved
@syphar
Copy link
Member Author

syphar commented Sep 9, 2021

In web::rustdoc we're down to only one failing test I need to dig into ( test_no_trailing_target_slash)

@syphar
Copy link
Member Author

syphar commented Sep 9, 2021

In web::rustdoc we're down to only one failing test I need to dig into ( test_no_trailing_target_slash)

Note to myself, I found it (not fixed yet).

Problem is that in FakeRelease.create we're I'm calling upload_files for rustdoc multiple times. In upload_files I then create a new archive for each call.
I probably can refactor it so we only create one archive which contains the files for all platforms.

So, only a test-fake issue, likely not prod.

@syphar syphar marked this pull request as ready for review September 12, 2021 10:38
@syphar syphar requested a review from jyn514 September 12, 2021 10:38
@syphar syphar added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments labels Sep 12, 2021
@syphar
Copy link
Member Author

syphar commented Sep 12, 2021

@jyn514 tests are green (🎉 ), this is ready for another review.

I would propose before merging this I'll squash it into a single commit.

I have some open points in the potential next steps, but as I understood I would tackle them when this is merged.

@jyn514 jyn514 merged commit dd650cb into rust-lang:master Sep 12, 2021
@syphar syphar deleted the compressed-storage branch September 12, 2021 16:35
@rust-lang rust-lang unlocked this conversation Sep 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend Area: Webserver backend A-builds Area: Building the documentation for a crate S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compress documentation per-crate, not per-file
5 participants