From 0b82d5a3d37615ff224a061dcdd1dac43e4dfc61 Mon Sep 17 00:00:00 2001 From: LightQuantum Date: Tue, 22 Aug 2023 03:46:00 +0800 Subject: [PATCH] docs: update README --- README.md | 82 +++++++++++++++++++++++++---------------- rsync-fetcher/README.md | 23 ++++++------ rsync-gateway/README.md | 27 ++++++++++---- rsync-gc/README.md | 14 +++---- 4 files changed, 85 insertions(+), 61 deletions(-) diff --git a/README.md b/README.md index 59d0a49..9083a2d 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ versions older than 2.6.0 are supported. * **rsync-fetcher** - fetches the repository from the remote server, and uploads it to s3. * **rsync-gateway** - serves the mirrored repository from s3 in **http** protocol. * **rsync-gc** - periodically removes old versions of files from s3. -* **rsync-fix-encoding** - see "Migrating from v0.2.11 to older versions" section. +* **rsync-migration** - see [Migration](#migration) section for more details. ## Example @@ -28,35 +28,34 @@ versions older than 2.6.0 are supported. $ RUST_LOG=info RUST_BACKTRACE=1 AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= \ rsync-fetcher \ --src rsync://upstream/path \ - --s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix repo_name \ - --redis redis://localhost --redis-namespace repo_name \ - --repository repo_name - --gateway-base http://localhost:8081/repo_name + --s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix prefix \ + --pg-url postgres://user@localhost/db \ + --namespace repo_name ``` 2. Serve the repository over HTTP. ```bash $ cat > config.toml <<-EOF bind = ["localhost:8081"] + s3_url = "https://s3_api_endpoint" + s3_region = "region" [endpoints."out"] - redis = "redis://localhost" - redis_namespace = "test" - s3_website = "http://localhost:8080/test/test-prefix" + namespace = "repo_name" + s3_bucket = "bucket" + s3_prefix = "prefix" EOF $ RUST_LOG=info RUST_BACKTRACE=1 rsync-gateway ``` - -3. GC old versions of files periodically. +3. GC old versions of files manually. ```bash $ RUST_LOG=info RUST_BACKTRACE=1 AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= \ rsync-gc \ --s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix repo_name \ - --redis redis://localhost --redis-namespace repo_name \ - --keep 2 + --pg-url postgres://user@localhost/db ``` - > It's recommended to keep at least 2 versions of files in case a gateway is still using an old revision. + > It's recommended to keep at least 2 revisions in case a gateway is still using an old revision. ## Design @@ -64,31 +63,50 @@ File data and their metadata are stored separately. ### Data -Files are stored in S3 storage, named by their blake2b-160 hash (``). - -Listing html pages are stored in `/listing-//index.html`. +Files are stored in S3 storage, named by their blake2b-160 hash (`//`). ### Metadata -Metadata is stored in Redis for fast access. +Metadata is stored in Postgres. + +An object is the smallest unit of metadata. There are three types of objects: +- **File** - a regular file, with its hash, size and mtime +- **Directory** - a directory, and its size and mtime +- **Symlink** - a symlink, with its size, mtime and target + +Objects (files, directories and symlinks) are organized into revisions, which are immutable. Each revision has a unique +id, while an object may appear in multiple revisions. Revisions are further organized into repositories (namespaces), +like `debian`, `ubuntu`, etc. Repositories are mutable. + +A revision can be in one of the following states: + +- **Live** - a live revision is a revision in production, which is ready to be served. There can be multiple live + revisions, but only the latest one is served by the gateway. +- **Partial** - a partial revision is a revision that is still being updated. It's not ready to be served yet. +- **Stale** - a stale revision is a revision that is no longer in production, and is ready to be garbage collected. + +## Migration + +### Migration from v0.3.x to v0.4.x + +v0.4.x switched from Redis to Postgres for storing metadata, greatly improving the performance of many operations and +reducing the storage usage. + +Use `rsync-migration redis-to-pg` to migrate old metadata to the new database. Note that you can only migrate from +v0.3.x to v0.4.x, and you can't migrate from v0.2.x to v0.4.x directly. -Note that there are more than one file index in Redis. +The old Redis database is not modified. -- `:index:` - an index of the repository synced at ``. -- `:partial` - a partial index that is still being updated and not committed yet. -- `:partial-stale` - a temporary index that is used to store outdated files when updating the partial index. - This might happen if you interrupt a synchronization, restart it, and some files downloaded in the first run are - already outdated. It's ready to be garbage collected. -- `:stale:` - an index that is taken out of production, and is ready to be garbage collected. +### Migrating from v0.2.x to v0.3.x -> Not all files in partial index should be removed. For example, if a file exists both in a stale index and a "live" -> index, it should not be removed. +v0.3.x uses a new encoding for file metadata, which is incompatible with v0.2.x. Trying to use v0.3.x on old data will +fail. -## Migrating from v0.2.11 to older versions +Use `rsync-migration upgrade-encoding` to upgrade the encoding. -There's a bug affecting all versions before v0.3.0 and after v0.2.11, which causes the file metadata to be read in a -wrong format and silently corrupting the index. Note that no data is lost, but the gateway will fail to direct users to -the correct file. `rsync-fix-encoding` can be used to fix this issue. +This is a destructive operation, so make sure you have a backup of the database before running it. It does nothing +without the `--do` flag. -After v0.3.0, all commands are using the new encoding. You can still use this tool to migrate old data to the new -encoding. Trying to use the new commands on old data will now fail. \ No newline at end of file +The new encoding is actually introduced in v0.2.12 by accident. `rsync-gateway` between v0.2.12 and v0.3.0 can't parse +old metadata correctly and return garbage data. No data is lost though, so if you used any version between v0.2.12 and +v0.3.0, you can still use `rsync-migration` to migrate to the new encoding. \ No newline at end of file diff --git a/rsync-fetcher/README.md b/rsync-fetcher/README.md index d32d451..61ad6c3 100644 --- a/rsync-fetcher/README.md +++ b/rsync-fetcher/README.md @@ -1,7 +1,7 @@ # rsync-fetcher This is a rsync receiver implementation. Simply put, it's much like rsync, but saves the files to s3 and metadata to -redis instead of to a filesystem. +the database instead of to a filesystem. ## Features @@ -17,19 +17,20 @@ redis instead of to a filesystem. ## Implementation Details -1. Connect to Redis and S3, check if there's already another instance (fetcher, gc) running. +1. Connect to Postgres and S3, check if there's already another instance (fetcher, gc) running. 2. Fetch file list from rsync server. -3. Calculate the delta between the remote file list and the local index, which is - the union of current production index and last partial index (if any). -4. Start generator and receiver task. -5. After both tasks completed, generate file listing and upload to S3. -6. Commit the partial index to production. +3. Calculate the delta between the remote file list and local files, which is the union of files in all live and partial +revisions. +4. Create a new partial revision. +5. Start generator and receiver task. +6. After both tasks completed, update some metadata (parents link) to speedup directory listing. +7. Commit the partial revision to production. Generator task: 1. Generates a list of files to be fetched, and sends them to the rsync server. -2. If any file exists in the local index, it downloads the file, calculate the rolling checksum, and additionally sends - the checksum to rsync server. +2. If any file exists in an existing live or partial revision, it downloads the file, calculate the rolling checksum, +and additionally sends the checksum to rsync server. Receiver task: @@ -40,6 +41,4 @@ Receiver task: Uploader task: 1. Take files downloaded by receiver task, and upload them to S3. -2. After uploading a file, updates the partial index. If the file already exists in the partial index, check if the - checksum matches. If not, put the old metadata into the partial-stale index, and update the partial index with the - new metadata. +2. After uploading a file, updates the partial revision. \ No newline at end of file diff --git a/rsync-gateway/README.md b/rsync-gateway/README.md index 6a004c8..9b29529 100644 --- a/rsync-gateway/README.md +++ b/rsync-gateway/README.md @@ -1,13 +1,24 @@ # rsync-gateway -`rsync-gateway` serves the rsync repository on S3 over HTTP, using the metadata stored in redis. +`rsync-gateway` serves the rsync repository on S3 over HTTP, using the metadata stored in the database. ## Implementation Details -1. Connect to Redis. -2. Spawn a watcher task to watch for the latest index. -3. For each request, if the path ends with a trailing slash, it's a directory listing request. Otherwise, it's a file - request. -4. For directory listing requests, redirect to `/index.html` on S3. -5. For file requests, check if the file exists in the index. If not, return 404. Otherwise, redirect to the file on S3. - Symlinks are resolved on the gateway side. \ No newline at end of file +1. Connect to Postgres. +2. Spawn a watcher task to watch for the latest revision. +3. For each request, check if there's a cache hit. Return the cached response if there is. +4. Otherwise, try to resolve the path to in the revision. If the path is a directory, render the directory listing. If +the path is a file, pre-sign the file on S3 and redirect to the pre-signed URL. Symlinks are followed. + +## More details on the cache + +There are two layers of cache: L1 and L2. Both of them are in-memory cache implemented using `moka`, a concurrent LRU +cache. + +L1 cache is raw resolved entries, while L2 cache is compressed entries. The L2 cache is used to reduce memory +usage, since the raw resolved entries can be quite large when there are many files in a directory. + +The size of the L1 cache is 32MB, and the size of the L2 cache is 128MB. There's a TTL for pre-signed URLs in both +caches, which is half the pre-signed URL expiration time. + +It's a NINE cache. The eviction of L1 and L2 caches are independent and asynchronous. \ No newline at end of file diff --git a/rsync-gc/README.md b/rsync-gc/README.md index 2128504..cb31fc5 100644 --- a/rsync-gc/README.md +++ b/rsync-gc/README.md @@ -4,14 +4,10 @@ ## Implementation Details -1. Connect to Redis and S3, check if there's already another instance (fetcher, gc) running. -2. Enumerate all production indexes and filter out ones to be removed. -3. Rename the indexes to be removed to `stale:`. -4. Delete all listing files in S3 belonging to the indexes to be removed. -5. Delete object files that are not referenced by any live index. +1. Connect to Postgres and S3, check if there's already another instance (fetcher, gc) running. +2. Enumerate all production revisions and filter out ones to be removed. Mark them as stale. +3. Delete object files that are not referenced by any live or partial revision. > Note that this is calculated by > - > Sigma_(stale) (key.hash) - Sigma_(alive) (key.hash) - > - > Because we don't have a way to get the universe set of all keys in S3. -6. Remove stale indexes from Redis. \ No newline at end of file + > Sigma_(stale) (key.hash) - Sigma_(live) (key.hash) - Sigma_(partial) (key.hash) +4. Remove stale revisions from Postgres. \ No newline at end of file