-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BulkDump Framework #11780
base: main
Are you sure you want to change the base?
BulkDump Framework #11780
Changes from all commits
26a47c6
3b285ab
bcb3e22
5eedd67
685f765
1e47f57
0684305
27a841d
f43bfbc
a5d4f9b
c8a7631
5d4af5a
faff0b4
640a561
0994783
3c4e2a4
c1128bf
a2976ed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,96 @@ | ||||||
############################## | ||||||
BulkDump (Dev) | ||||||
############################## | ||||||
|
||||||
| Author: Zhe Wang | ||||||
| Reviewer: Michael Stack, Jingyu Zhou | ||||||
| Audience: FDB developers, SREs and expert users. | ||||||
|
||||||
|
||||||
Overview | ||||||
======== | ||||||
In a FoundationDB (FDB) key-value cluster, every key-value pair is replicated across multiple storage servers. | ||||||
The BulkDump tool is developed to dump all key-value pairs within the input range to files. | ||||||
Note that when the input range is large, the range is splitted into smaller ranges. | ||||||
Each subrange data is dumped to a file at a version. All data within a file is at the same version. However, different files' version can be different. | ||||||
|
||||||
Input and output | ||||||
---------------- | ||||||
When a user wants to start a bulkdump job, the user provides the range to dump and the path root to dump the data. | ||||||
The range can be any subrange within the user key space (i.e. " " ~ "\\xff"). | ||||||
Dumping the data of the system key space and special key space (i.e. "\\xff" ~ "\\xff\\xff\\xff") is not allowed. | ||||||
The path root can be either a blobstore url (TBD) or a path of a file system. | ||||||
Given the input range, if the range is large, the range is splitted into smaller ranges. | ||||||
Each subrange is dump at a version to a folder. In particular, the folder is organized as following: | ||||||
|
||||||
1. <rootLocal>/<relativeFolder>/<dumpVersion>-manifest.txt (must have) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format has some problem, as the rendered output looks like |
||||||
2. <rootLocal>/<relativeFolder>/<dumpVersion>-data.sst (omitted if the subrange is empty) | ||||||
3. <rootLocal>/<relativeFolder>/<dumpVersion>-sample.sst (omitted if data size is too small to have a sample) | ||||||
|
||||||
The <relativeFolder> is defined as <JobId>/<TaskId>/<BatchId>. | ||||||
At any time, a FDB cluster can have at most one bulkdump job. | ||||||
A bulkdump job is partitioned into tasks by range and according to the shard boundary. | ||||||
When dumping the range of a task, the data is collected in batches. All key-value pairs of a batch is collected at the same version. | ||||||
Above all, <JobId> is the unique ID of a job. <TaskId> is the unique ID of a task. <BatchId> is the unique ID of a batch. | ||||||
All tasks's data files of the same job locates at the same Job folder named by the JobId. | ||||||
|
||||||
Each <relativeFolder> corresponds to exactly one subrange with exactly one manifest file. | ||||||
The manifest file includes all necessary information for loading the data from the folder to a FDB cluster. | ||||||
The manifest file content includes following information: | ||||||
|
||||||
1. File paths (including the path root) | ||||||
2. Key Range of the dumped data in the folder | ||||||
3. Version when the data of the range is collected | ||||||
4. Checksum of the data | ||||||
5. Datasize of the data in bytes | ||||||
6. Bytes sampling setting (when a cluster loads the folder, if the setting mismatches, the loading cluster does bytes sampling by itself; Otherwise, the loading cluster directly uses the sample file of the folder). | ||||||
|
||||||
In the job folder, there is a global manifest file including all ranges and their corresponding manifest file. | ||||||
When loading a cluster, users can use this global manifest to rebuild the data. | ||||||
|
||||||
How to use? | ||||||
----------- | ||||||
Currently, low-level transactional APIs are provided to submit a job or clear a job. | ||||||
These operations are achieved by issuing transactions to update the bulkdump metadata. | ||||||
Submitting a job is achieved by writting the job metadata to the bulkdump metadata range of the job. | ||||||
When submitting a job, the API checks if there is any ongoing bulkdump job. If yes, it will reject the job. Otherwise, it accepts the job. | ||||||
Clearing a job is achieved by erasing the entire user range space of the bulkdump metadata range. When clearing a job, all metadata will be cleared and any ongoing task is stopped (with some latency). | ||||||
|
||||||
Currently, ManagementAPI provides following interfaces to do the operations: | ||||||
|
||||||
1. Submit a job: submitBulkDumpJob(BulkDumpState job); // For generating the input job metadata, see the point 4. | ||||||
2. Clear a job: clearBulkDumpJob(); | ||||||
3. Enable the feature: setBulkDumpMode(int mode); // Set mode = 1 to enable; Set mode = 0 to disable. | ||||||
4. BulkDump job metadata is generated by newBulkDumpTaskLocalSST(KeyRange range, std::string remoteRoot); // Will include more APIs to generate the metadata as the funcationality expands. | ||||||
|
||||||
Mechanisms | ||||||
========== | ||||||
|
||||||
Workflow | ||||||
-------- | ||||||
- Users input a range by a transaction and this range is persisted to bulkdump metadata (with "\\xff/bulkDump/" prefix). | ||||||
- Bulkdump metadata is range-based. | ||||||
- DD observes this range to dump by reading from the metadata. | ||||||
- DD partitions the range into smaller ranges according to the shard boundary. | ||||||
- DD randomly chooses one storage server which owns the range as the agent to do the dump. DD holds outstanding promise with this SS. The task assigned to a SS is stateless. | ||||||
- DD sends the range dump request to the storage server. DD spawns a dedicated actor waiting on the call. If any failure happens at SS side, DD will know this. | ||||||
- DD sends the range dump request within the max parallelism specified by the knob DD_BULKDUMP_PARALLELISM. | ||||||
- SS recieves the request and read the data from local storage. If the range has been moved away or splitted, the SS replies failure to the DD and DD will retry the remaining range later. If the range is there, SS read the data and upload the data to external storage. This PR only implements to dump the data to local disk. There will be a PR to dump the data to S3. | ||||||
- When SS completes, the SS marks this range as completed in the metadata. | ||||||
|
||||||
Invariant | ||||||
--------- | ||||||
- At any time, FDB cluster accepts at most one bulkdump job. | ||||||
- DD partitions the range into subranges according to the shard boundary. For a subrange, the data is guaranteed to put into the same folder --- same as task ID. | ||||||
- Each data filename is the version indicating the version of the data read by the SS. | ||||||
- Each subrange always has one manifest file indicating the metadata information of the data, such as Range, Checksum (to implement later in a separate PR), and FilePath. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- In SS, we dump files at first and then write metadata in the system key space. If any phase is failed, DD will re-do the range. For each time SS writes the folder (locally or in BlobStore), the SS erases the folder at first. | ||||||
- A SS handles at most one dump task at a time (the parallelism is protected by the knob SS_SERVE_BULKDUMP_PARALLELISM. With current implementation, this knob is set to 1. However, we leave the flexibility of setting bulkdump parallelism at a SS here). | ||||||
- Each subrange does not necessarily have a byteSample file and data file which depends on the data size. A SS may be assigned a range but the range is empty. | ||||||
- When user issuing a bulk dump task, the client will check if there is an ongoing bulk load task. If yes, reject the request. | ||||||
|
||||||
Failure handling | ||||||
---------------- | ||||||
- SS failure: DD will receive broken_promise. DD gives up working on the range at this time. DD will issue the request in the future until the range completes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- DD failure: It is possible that the same SS recieves two requests to work on the same range. SS uses a FlowLock to guarantee that SS handles one request at a time. So, there is no conflict. | ||||||
- S3 outage: Result in task failure. The failed task will be retried by DD. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
/* | ||
* BulkDumping.cpp | ||
* | ||
* This source file is part of the FoundationDB open source project | ||
* | ||
* Copyright 2013-2024 Apple Inc. and the FoundationDB project authors | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
#include "fdbclient/BulkDumping.h" | ||
|
||
BulkDumpState newBulkDumpTaskLocalSST(const KeyRange& range, const std::string& remoteRoot) { | ||
return BulkDumpState( | ||
range, BulkDumpFileType::SST, BulkDumpTransportMethod::CP, BulkDumpExportMethod::File, remoteRoot); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.