Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

Backup to common cloud storage #89

Open
7 of 9 tasks
tennix opened this issue Dec 8, 2019 · 12 comments
Open
7 of 9 tasks

Backup to common cloud storage #89

tennix opened this issue Dec 8, 2019 · 12 comments
Assignees
Labels
difficulty/3-hard Hard issue Priority/P1 High priority issue. Must have an associated milestone

Comments

@tennix
Copy link
Member

tennix commented Dec 8, 2019

BR support common cloud storage

Overview

Integrate BR with common cloud object storage (S3, GCS, Azure Blob storage etc).

Problem statement

Currently, BR supports local storage where backup files are stored on local directory. But the backup files need to be collected together and copied to every TiKV node. This is difficult to use in practice, so it's better to mount an NFS like filesystem to every TiKV node and BR node. However, mounting NFS to every node is difficult to set up and error-prone.

Alternatively, object storage is better for this scenario, especially that it's quite common to backup to S3/GCS on public cloud.

TODO list

S3 Support (2400 points)

GCS Support (1800 points)

TiDB Operator integration (900 points)

Test (2100 points)

Mentors

Recommended skills

  • Go language
  • Rust language
  • Familiar with S3/GCS/Azure blob storage
@yiwu-arbug
Copy link
Contributor

I want to join in this task~

@SunRunAway
Copy link
Contributor

@francis0407 and I also have interest in this issue.

@gregwebs
Copy link
Contributor

gregwebs commented Dec 9, 2019

Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.

@tennix
Copy link
Member Author

tennix commented Dec 10, 2019

Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.

What do you mean "without copying data in memory"?

@gregwebs
Copy link
Contributor

I guess if streaming is working well then perhaps it isn't necessary to state this. But we should avoid copying wherever possible. The current implementation uses read_to_end which I presume copies the data every time it re-sizes its vector.

@SunRunAway
Copy link
Contributor

Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.

What do you mean "without copying data in memory"?

I think it is that uploading a file from disk without buffering all file data into memory.

@gregwebs
Copy link
Contributor

I think it is that uploading a file from disk without buffering all file data into memory.

Even though it is an SST file, I think it is starting from in memory?

@kennytm
Copy link
Collaborator

kennytm commented Dec 10, 2019

The writer interface on TiKV is currently

    fn write(&self, name: &str, reader: &mut dyn Read) -> io::Result<()>;

so the cloud storage implementation decides how to read the SST file and upload. For local storage we use Rust's built-in std::io::copy which streams up to 8 KiB at a time.

@yiwu-arbug
Copy link
Contributor

I think it is that uploading a file from disk without buffering all file data into memory.

Even though it is an SST file, I think it is starting from in memory?

Just checked and the file content is in memory before being uploaded.

@gregwebs
Copy link
Contributor

@yiwu-arbug when the file is created in memory, can it be created in a streaming fashion?

@kennytm
Copy link
Collaborator

kennytm commented Dec 13, 2019

@gregwebs The SST writer is currently using an in-memory storage to speed up SST generation. This is the first copy.

https://github.com/tikv/tikv/blob/b147b38eeaae4129e7e3eb7c06e270b8e0682697/components/backup/src/writer.rs#L108-L117

We then write the content into a &mut Vec<u8> buffer. This is the second copy.

https://github.com/tikv/tikv/blob/b147b38eeaae4129e7e3eb7c06e270b8e0682697/components/engine_rocks/src/sst.rs#L209-L212

The buffer is turned into a rate-limited reader, and passed into write(). In tikv/tikv#6209, for simplicity, read_to_end() to used to extract the entire content into &mut Vec<u8> again. This is the third copy.

https://github.com/tikv/tikv/blob/b826af388050fc48de2ce36a7684d1309d00e83a/components/external_storage/src/s3.rs#L94

So in the current situation we will have to deal with 3N bytes at the time. The second and third copies can be eliminated by streaming, which reduces the memory usage to N bytes (assume buffer size ≪ N).

Typically N = region size = 96 MB, and the default concurrency is 4, so we're talking about using memory of ~1200 MB vs ~400 MB here.


@yiwu-arbug That said, for tikv/tikv#6209, the advantage of streaming over read_to_end() isn't about the 800 MB of memory, but that the rate limit is really being respected.

Suppose we set the rate limit to 10 MB/s. With streaming, we will upload 1 MB every 0.1s, and the attain the average speed of 10 MB/s uniformly. With read_to_end(), however, we will sleep for 9.6s, and then suddenly upload the entire 96 MB file at maximum speed. This doesn't sound like the desired behavior.

@gregwebs
Copy link
Contributor

Yes, please get rid of read_to_end! My understanding is that it will re-size its buffer, so that means copying memory around during the operation, slowing things down further.
From my quick looking at the TiKV source code to follow things back to the source, I see the SST file being generated by impl SstWriter for RocksSstWriter using read_to_end! I am not sure if this is the right code path. However, the point is that it does seem that we should be able to stream data out of RocksDB straight into S3 without buffering all the data and copying.
As you said, besides making memory usage predictable it also makes network usage predictable. Additionally, if there is more network bandwidth available then it will be possible to increase the concurrency, whereas if we are loading up memory and periodically maxing out network usage the backup will have to stay throttled. The backup operation can slow down the rest of the queries by delaying the GC, so the sooner it can finish the better.

@kennytm kennytm added difficulty/3-hard Hard issue Priority/P1 High priority issue. Must have an associated milestone labels May 28, 2020
@kennytm kennytm added this to the v4.0.1 milestone May 28, 2020
@3pointer 3pointer removed this from the v4.0.2 milestone Jun 19, 2020
@DanielZhangQD DanielZhangQD removed their assignment Feb 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
difficulty/3-hard Hard issue Priority/P1 High priority issue. Must have an associated milestone
Projects
None yet
Development

No branches or pull requests

8 participants