Skip to content

Commit

Permalink
docs: update README
Browse files Browse the repository at this point in the history
  • Loading branch information
PhotonQuantum committed Aug 21, 2023
1 parent 4551a3d commit 1a0c7f1
Show file tree
Hide file tree
Showing 4 changed files with 85 additions and 61 deletions.
82 changes: 50 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ versions older than 2.6.0 are supported.
* **rsync-fetcher** - fetches the repository from the remote server, and uploads it to s3.
* **rsync-gateway** - serves the mirrored repository from s3 in **http** protocol.
* **rsync-gc** - periodically removes old versions of files from s3.
* **rsync-fix-encoding** - see "Migrating from v0.2.11 to older versions" section.
* **rsync-migration** - see [Migration](#migration) section for more details.

## Example

Expand All @@ -28,67 +28,85 @@ versions older than 2.6.0 are supported.
$ RUST_LOG=info RUST_BACKTRACE=1 AWS_ACCESS_KEY_ID=<ID> AWS_SECRET_ACCESS_KEY=<KEY> \
rsync-fetcher \
--src rsync://upstream/path \
--s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix repo_name \
--redis redis://localhost --redis-namespace repo_name \
--repository repo_name
--gateway-base http://localhost:8081/repo_name
--s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix prefix \
--pg-url postgres://user@localhost/db \
--namespace repo_name
```
2. Serve the repository over HTTP.
```bash
$ cat > config.toml <<-EOF
bind = ["localhost:8081"]
s3_url = "https://s3_api_endpoint"
s3_region = "region"
[endpoints."out"]
redis = "redis://localhost"
redis_namespace = "test"
s3_website = "http://localhost:8080/test/test-prefix"
namespace = "repo_name"
s3_bucket = "bucket"
s3_prefix = "prefix"
EOF
$ RUST_LOG=info RUST_BACKTRACE=1 rsync-gateway <optional config file>
```
3. GC old versions of files periodically.
3. GC old versions of files manually.
```bash
$ RUST_LOG=info RUST_BACKTRACE=1 AWS_ACCESS_KEY_ID=<ID> AWS_SECRET_ACCESS_KEY=<KEY> \
rsync-gc \
--s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix repo_name \
--redis redis://localhost --redis-namespace repo_name \
--keep 2
--pg-url postgres://user@localhost/db
```
> It's recommended to keep at least 2 versions of files in case a gateway is still using an old revision.
> It's recommended to keep at least 2 revisions in case a gateway is still using an old revision.
## Design
File data and their metadata are stored separately.
### Data
Files are stored in S3 storage, named by their blake2b-160 hash (`<namespace/<hash>`).
Listing html pages are stored in `<namespace>/listing-<timestamp>/<path>/index.html`.
Files are stored in S3 storage, named by their blake2b-160 hash (`<prefix>/<namespace>/<hash>`).
### Metadata
Metadata is stored in Redis for fast access.
Metadata is stored in Postgres.
An object is the smallest unit of metadata. There are three types of objects:
- **File** - a regular file, with its hash, size and mtime
- **Directory** - a directory, and its size and mtime
- **Symlink** - a symlink, with its size, mtime and target
Objects (files, directories and symlinks) are organized into revisions, which are immutable. Each revision has a unique
id, while an object may appear in multiple revisions. Revisions are further organized into repositories (namespaces),
like `debian`, `ubuntu`, etc. Repositories are mutable.
A revision can be in one of the following states:
- **Live** - a live revision is a revision in production, which is ready to be served. There can be multiple live
revisions, but only the latest one is served by the gateway.
- **Partial** - a partial revision is a revision that is still being updated. It's not ready to be served yet.
- **Stale** - a stale revision is a revision that is no longer in production, and is ready to be garbage collected.
## Migration
### Migration from v0.3.x to v0.4.x
v0.4.x switched from Redis to Postgres for storing metadata, greatly improving the performance of many operations and
reducing the storage usage.
Use `rsync-migration redis-to-pg` to migrate old metadata to the new database. Note that you can only migrate from
v0.3.x to v0.4.x, and you can't migrate from v0.2.x to v0.4.x directly.
Note that there are more than one file index in Redis.
The old Redis database is not modified.
- `<namespace>:index:<timestamp>` - an index of the repository synced at `<timestamp>`.
- `<namespace>:partial` - a partial index that is still being updated and not committed yet.
- `<namespace>:partial-stale` - a temporary index that is used to store outdated files when updating the partial index.
This might happen if you interrupt a synchronization, restart it, and some files downloaded in the first run are
already outdated. It's ready to be garbage collected.
- `<namespace>:stale:<timestamp>` - an index that is taken out of production, and is ready to be garbage collected.
### Migrating from v0.2.x to v0.3.x
> Not all files in partial index should be removed. For example, if a file exists both in a stale index and a "live"
> index, it should not be removed.
v0.3.x uses a new encoding for file metadata, which is incompatible with v0.2.x. Trying to use v0.3.x on old data will
fail.
## Migrating from v0.2.11 to older versions
Use `rsync-migration upgrade-encoding` to upgrade the encoding.
There's a bug affecting all versions before v0.3.0 and after v0.2.11, which causes the file metadata to be read in a
wrong format and silently corrupting the index. Note that no data is lost, but the gateway will fail to direct users to
the correct file. `rsync-fix-encoding` can be used to fix this issue.
This is a destructive operation, so make sure you have a backup of the database before running it. It does nothing
without the `--do` flag.
After v0.3.0, all commands are using the new encoding. You can still use this tool to migrate old data to the new
encoding. Trying to use the new commands on old data will now fail.
The new encoding is actually introduced in v0.2.12 by accident. `rsync-gateway` between v0.2.12 and v0.3.0 can't parse
old metadata correctly and return garbage data. No data is lost though, so if you used any version between v0.2.12 and
v0.3.0, you can still use `rsync-migration` to migrate to the new encoding.
23 changes: 11 additions & 12 deletions rsync-fetcher/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# rsync-fetcher

This is a rsync receiver implementation. Simply put, it's much like rsync, but saves the files to s3 and metadata to
redis instead of to a filesystem.
the database instead of to a filesystem.

## Features

Expand All @@ -17,19 +17,20 @@ redis instead of to a filesystem.

## Implementation Details

1. Connect to Redis and S3, check if there's already another instance (fetcher, gc) running.
1. Connect to Postgres and S3, check if there's already another instance (fetcher, gc) running.
2. Fetch file list from rsync server.
3. Calculate the delta between the remote file list and the local index, which is
the union of current production index and last partial index (if any).
4. Start generator and receiver task.
5. After both tasks completed, generate file listing and upload to S3.
6. Commit the partial index to production.
3. Calculate the delta between the remote file list and local files, which is the union of files in all live and partial
revisions.
4. Create a new partial revision.
5. Start generator and receiver task.
6. After both tasks completed, update some metadata (parents link) to speedup directory listing.
7. Commit the partial revision to production.

Generator task:

1. Generates a list of files to be fetched, and sends them to the rsync server.
2. If any file exists in the local index, it downloads the file, calculate the rolling checksum, and additionally sends
the checksum to rsync server.
2. If any file exists in an existing live or partial revision, it downloads the file, calculate the rolling checksum,
and additionally sends the checksum to rsync server.

Receiver task:

Expand All @@ -40,6 +41,4 @@ Receiver task:
Uploader task:

1. Take files downloaded by receiver task, and upload them to S3.
2. After uploading a file, updates the partial index. If the file already exists in the partial index, check if the
checksum matches. If not, put the old metadata into the partial-stale index, and update the partial index with the
new metadata.
2. After uploading a file, updates the partial revision.
27 changes: 19 additions & 8 deletions rsync-gateway/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,24 @@
# rsync-gateway

`rsync-gateway` serves the rsync repository on S3 over HTTP, using the metadata stored in redis.
`rsync-gateway` serves the rsync repository on S3 over HTTP, using the metadata stored in the database.

## Implementation Details

1. Connect to Redis.
2. Spawn a watcher task to watch for the latest index.
3. For each request, if the path ends with a trailing slash, it's a directory listing request. Otherwise, it's a file
request.
4. For directory listing requests, redirect to `<path>/index.html` on S3.
5. For file requests, check if the file exists in the index. If not, return 404. Otherwise, redirect to the file on S3.
Symlinks are resolved on the gateway side.
1. Connect to Postgres.
2. Spawn a watcher task to watch for the latest revision.
3. For each request, check if there's a cache hit. Return the cached response if there is.
4. Otherwise, try to resolve the path to in the revision. If the path is a directory, render the directory listing. If
the path is a file, pre-sign the file on S3 and redirect to the pre-signed URL. Symlinks are followed.

## More details on the cache

There are two layers of cache: L1 and L2. Both of them are in-memory cache implemented using `moka`, a concurrent LRU
cache.

L1 cache is raw resolved entries, while L2 cache is compressed entries. The L2 cache is used to reduce memory
usage, since the raw resolved entries can be quite large when there are many files in a directory.

The size of the L1 cache is 32MB, and the size of the L2 cache is 128MB. There's a TTL for pre-signed URLs in both
caches, which is half the pre-signed URL expiration time.

It's a NINE cache. The eviction of L1 and L2 caches are independent and asynchronous.
14 changes: 5 additions & 9 deletions rsync-gc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,10 @@

## Implementation Details

1. Connect to Redis and S3, check if there's already another instance (fetcher, gc) running.
2. Enumerate all production indexes and filter out ones to be removed.
3. Rename the indexes to be removed to `stale:<timestamp>`.
4. Delete all listing files in S3 belonging to the indexes to be removed.
5. Delete object files that are not referenced by any live index.
1. Connect to Postgres and S3, check if there's already another instance (fetcher, gc) running.
2. Enumerate all production revisions and filter out ones to be removed. Mark them as stale.
3. Delete object files that are not referenced by any live or partial revision.
> Note that this is calculated by
>
> Sigma_(stale) (key.hash) - Sigma_(alive) (key.hash)
>
> Because we don't have a way to get the universe set of all keys in S3.
6. Remove stale indexes from Redis.
> Sigma_(stale) (key.hash) - Sigma_(live) (key.hash) - Sigma_(partial) (key.hash)
4. Remove stale revisions from Postgres.

0 comments on commit 1a0c7f1

Please sign in to comment.