diff --git a/keps/sig-storage/3314-csi-changed-block-tracking/README.md b/keps/sig-storage/3314-csi-changed-block-tracking/README.md
index 37dcc2ba52a..7de6c2c9019 100644
--- a/keps/sig-storage/3314-csi-changed-block-tracking/README.md
+++ b/keps/sig-storage/3314-csi-changed-block-tracking/README.md
@@ -58,7 +58,7 @@ If none of those approvers are still appropriate, then changes to that list
should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
-# KEP-3314: CSI Changed Block Tracking
+# KEP-3314: CSI Changed Block Tracking
This KEP proposes new CSI API that can be used to identify the list of changed
-blocks between pairs of CSI volume snapshots. CSI drivers can implement this API
+blocks between pairs of CSI volume snapshots. CSI drivers can implement this API
to expose their changed block tracking (CBT) services to enable efficient and
reliable differential backup of data stored in CSI volumes.
+Kubernetes backup applications directly use this API to stream changed
+block information, bypassing and posing no additional load on the Kubernetes
+API server.
+The mechanism that enables this direct access utilizes a proxy service sidecar
+to shield the CSI drivers from managing the individual Kubernetes clients.
+
## Motivation
-Changed block tracking (CBT) techniques have been used by commercial backup
-systems to efficiently back up large amount of data in block volumes. They
-identify block-level changes between two arbitrary pair of snapshots of the
-same block volume, and selectively back up what has changed between the two
-checkpoints. This type of differential backup approach is a lot more efficient
-than backing up the entire volume.
+Changed block tracking (CBT) techniques have been used by commercial backup
+systems to efficiently back up large amount of data in block volumes. They
+identify block-level changes between two arbitrary pair of snapshots of the
+same block volume, and selectively back up what has changed between the two
+checkpoints. This type of differential backup approach is a lot more efficient
+than backing up the entire volume.
-This KEP proposes a design to extend the Kubernetes CSI framework to utilize
-these CBT features to bring efficient, cloud-native data protection to
+This KEP proposes a design to extend the Kubernetes CSI framework to utilize
+these CBT features to bring efficient, cloud-native data protection to
Kubernetes users.
### Goals
@@ -207,13 +226,13 @@ List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->
-* Provide a secure, idiomatic CSI API to efficiently identify changes between
-two arbitrary pairs of CSI volume snapshots of the same block volume.
-* Relay large amount of CBT data from the storage provider back to the user
-without exhausting cluster resources, nor introducing flaky resource spikes.
+* Provide a secure, idiomatic CSI API to efficiently identify
+the allocated blocks of a CSI volume snapshot, and
+the changed blocks between two arbitrary pairs of CSI volume snapshots
+of the same block volume.
+* Relay large amount of snapshot metadata from the storage provider without
+overloading the Kubernetes API server.
* This API is an optional component of the CSI framework.
-* Provide CBT support for both block as well as file system mode (backed by block
-volume) persistent volumes.
### Non-Goals
@@ -222,10 +241,20 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->
-* Retrieval of the actual changed data blocks is outside the scope of this KEP.
-The proposed API only returns the metadata that helps to identify the actual data
-blocks
-* Support of file changed list tracking for network file shares is out-of-scope.
+* Specify how data is written to the block volume in the first place.
+ > The volume could be attached to a pod with either `Block` or `Filesystem`
+ [volume modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-mode).
+* Provide an API to retrieve the data blocks of a snapshot.
+ > The **snapshot session** mechanism proposed here could conceivably
+ support a future companion **snapshot data retrieval service**, but
+ this is not in the scope of this KEP.
+ It is assumed that a snapshot's data blocks can be retrieved by creating a
+ PersistentVolume for the snapshot, launching a pod with this volume
+ attached in `Block` volume mode, and then reading the individual
+ blocks from the raw block device.
+
+* Support of file changed list tracking for network file shares is not
+addressed by this proposal.
## Proposal
@@ -238,7 +267,88 @@ The "Design Details" section below is for the real
nitty-gritty.
-->
-### User Stories (Optional)
+The proposal extends the CSI specification with a new
+[SnapshotMetadata](#the-snapshotmetadata-service-api)
+gRPC service that is used
+to retrieve metadata on the allocated blocks of a single snapshot,
+or the changed blocks between a pair of snapshots of the same block volume.
+A number of custom resources are proposed to enable a Kubernetes backup application
+to create a **snapshot session** with which to ***directly connect***
+to such a service.
+This direct connection results in a minimal load on the Kubernetes API server,
+unrelated to the amount of metadata transferred
+or the sizes of the volumes and snapshots involved.
+
+
+
+A Kubernetes backup application establishes a snapshot session by
+creating an instance of a [SnapshotSessionRequest](#snapshotsessionrequest)
+custom resource, specifying a set of VolumeSnapshot objects in some Namespace.
+The application must poll the CR until it reaches a terminal state of
+`Ready` or `Failed`.
+
+The creation and use of the snapshot session is illustrated by the following figure:
+
+
+
+The [SnapshotSessionRequest](#snapshotsessionrequest) CR
+will validate its creator's authority to create the CR and to access the set
+of VolumeSnapshots. It will then
+search for a [SnapshotMetadata](#the-snapshotmetadata-service-api) service
+in the CSI driver for these VolumeSnapshots.
+On success, the TCP endpoint and CA certificate of the
+[SnapshotMetadata](#the-snapshotmetadata-service-api)
+service and an opaque **snapshot session token** is set in its result.
+
+The backup application will establish trust with the specified CA, and
+then use the specified TCP endpoint to directly make TLS gRPC calls to the CSI
+[SnapshotMetadata](#the-snapshotmetadata-service-api) service.
+All RPC calls in the service require that the snapshot session token and the
+names of the Kubernetes VolumeSnapshot objects involved be specified,
+along with other optional parameters.
+The RPC calls each return a gRPC stream through which the metadata can be recovered.
+
+The CSI driver is not involved in the setup or management of the snapshot session.
+The TCP endpoint returned is actually directed to a
+[external-snapshot-session sidecar](#the-external-snapshot-session-sidecar)
+that communicates over a private UNIX domain socket with the CSI driver's
+implementation of the [SnapshotMetadata](#the-snapshotmetadata-service-api)
+service.
+The sidecar is responsible for validating the opaque snapshot session token
+and the parameters of the RPC calls.
+It forwards the RPC call to the CSI driver service over the UNIX domain socket,
+translating the Kubernetes object names into SP object names in the process,
+and then re-streams the results back to its client.
+The CSI driver provided service only focuses on the generation of the metadata
+requested.
+
+The [SnapshotSessionRequest](#snapshotsessionrequest) CR is animated by a
+[Snapshot Session Manager](#the-snapshot-session-manager),
+which provides a validating webhook
+for authorization and a controller to set up the snapshot session
+and manage the lifecycle of the CR, including deleting it when it expires.
+The manager will handle snapshot sessions involving snapshots from all CSI drivers.
+
+Additional simple CRs that do not involve a controller are also used:
+the [SnapshotServiceConfiguration](#snapshotserviceconfiguration) CR is used to advertise the
+existence of a [external-snapshot-session sidecar](#the-external-snapshot-session-sidecar)
+in a CSI driver,
+and the [SnapshotSessionData](#snapshotsessiondata) CR is created for each
+active snapshot session and used for validation.
+
+[Kubernetes Role-Based Access Control](https://kubernetes.io/docs/reference/access-authn-authz/rbac/)
+is used to secure access to the custom resources.
+It restricts a backup application's visibility to
+[SnapshotSessionRequest](#snapshotsessionrequest) CR
+objects to the Namespace containing VolumeSnapshots that the application
+is authorized to access.
+It also provides the ability to isolate CSI
+drivers from each other by limiting the visibility of
+[SnapshotServiceConfiguration](#snapshotserviceconfiguration) and
+[SnapshotSessionData](#snapshotsessiondata) CRs to the
+individual driver Namespace.
+
+### User Stories
-#### Story 1
-#### Story 2
+#### Full snapshot backup
+
+@TODO Prasad
+
+#### Incremental snapshot backup
+
+@TODO Prasad
-### Notes/Constraints/Caveats (Optional)
+### Notes/Constraints/Caveats
+- This proposal requires a backup application to directly connect to a CSI
+[SnapshotMetadata](#the-snapshotmetadata-service-api)
+service.
+This was necessary to not place a load on the Kubernetes API server
+that would be proportional to the number of allocated blocks in a volume
+snapshot.
+
+- The CSI
+[SnapshotMetadata](#the-snapshotmetadata-service-api)
+service RPC calls allow an application to ***restart*** an interrupted
+stream from where it previously failed
+by reissuing the RPC call with a starting byte offset.
+
+- The CSI
+[SnapshotMetadata](#the-snapshotmetadata-service-api)
+service permits metadata to be returned in either an ***extent***
+or a ***block*** based format, at the discretion of the CSI driver.
+A portable backup application is expected to handle both such formats.
+
+- All the volumes in a given snapshot session must have the same CSI provisioner.
+ The backup application must use separate snapshot sessions for volumes
+ from different CSI provisioners.
+
+- A snapshot session has a finite lifetime and will expire eventually.
+
+- The CSI driver's [Snapshot Session Service](#the-sp-snapshot-session-service)
+must be capable of serving metadata on a VolumeSnapshot
+concurrently with the backup application's use of a PersistentVolume
+created on that same VolumeSnapshot.
+This is because a backup application would likely mount the PersistentVolume
+in `Block` mode in a Pod in order to read and archive the raw snapshot data blocks,
+and this read/archive loop will be driven by the stream of snapshot block metadata.
+
### Risks and Mitigations
+The main vulnerabilities of this proposal are:
+- That the snapshot session indirectly provides a principle with the
+ authority to create a
+ [SnapshotSessionRequest](#snapshotsessionrequest) CR,
+ access to data in otherwise inaccessible VolumeSnapshots by simply naming
+ them in the CR.
+- That the opaque snapshot session token returned by a
+ [SnapshotSessionRequest](#snapshotsessionrequest) CR
+ be spoofed by a malicious actor.
+
+This proposal relies on the following Kubernetes security mechanisms to mitigate
+the issues above:
+- A validating webhook is used during the creation of the
+ [SnapshotSessionRequest](#snapshotsessionrequest) CR to
+ ensure that the invoker has access rights to VolumeSnapshot
+ objects in the Namespace of the CR.
+
+- [Role-Based Access Controls](https://kubernetes.io/docs/reference/access-authn-authz/rbac/)
+ to restrict access to the
+ [SnapshotSessionRequest](#snapshotsessionrequest), the
+ [SnapshotSessionData](#snapshotsessiondata) and the
+ [SnapshotServiceConfiguration](#snapshotserviceconfiguration) CRs.
+
+The backup application obtains a
+[SnapshotMetadata](#the-snapshotmetadata-service-api) service
+server's CA certificate and endpoint address from the
+[SnapshotSessionRequest](#snapshotsessionrequest) CR.
+The CA certificate and the end point were sourced from the
+[SnapshotServiceConfiguration](#snapshotserviceconfiguration) CR
+created by the CSI driver, and contain
+public information and not particularly vulnerable.
+
+The direct gRPC call made by the backup application client will encrypt
+all data exchanged with server.
+
+The gRPC client is required to establish trust with the server's CA.
+Mutual authentication, however, is not performed, as to do so would require the
+server to trust the certificate authorities used by its clients and no
+standardized mechanism exists for this purpose.
+
+Instead, the backup application passes the opaque snapshot session token
+returned in the [SnapshotSessionRequest](#snapshotsessionrequest) CR
+as a parameter in each RPC.
+This token is acually the name of a
+[SnapshotSessionData](#snapshotsessiondata) CR created by the
+[Snapshot Session Manager](#the-snapshot-session-manager)
+in the Namespace of the CSI driver, and only accessible to
+the manager and the server.
+The server in this case is is the
+[external-snapshot-session-sidecar](#the-external-snapshot-session-sidecar),
+and it will validate the caller's use of the RPC by retrieving the
+CR with the name of the token.
+
+To mitigate the possibility that the token is spoofed:
+- The session token is composed of a long random string of valid
+ Kubernetes object name characters.
+- The visibility of [SnapshotSessionData](#snapshotsessiondata) CRs
+ are restricted to the [Snapshot Session Manager](#the-snapshot-session-manager)
+ and the server in the CSI driver Namespace.
+ A CSI driver installed in a private Namespace would only be able to
+ view the [SnapshotSessionData](#snapshotsessiondata) CRs in
+ its own Namespace.
+- A [SnapshotSessionData](#snapshotsessiondata) CR has a finite
+ lifespan and will be rejected by the side car,
+ and eventually deleted by the manager, when its
+ expiry time has passed.
+
+The proposal defines the following ClusterRoles
+to implement the necessary security as illustrated in the following figure:
+
+>@TODO
+> - Decide on namespaced v/s global SnapshotSessionConfiguration. Global will
+> require a new role.
+
+
+
+
+- The **SnapshotSessionClient** ClusterRole should be used in a
+ ClusterRoleBinding to grant a backup application's ServiceAccount
+ global access to CREATE, GET, DELETE or LIST
+ [SnapshotSessionRequest](#snapshotsessionrequest) CRs
+ in any namespace and to GET VolumeSnapshot
+ objects in any namespace.
+- The **SnapshotSessionService** ClusterRole should be used in a
+ RoleBinding to grant the ServiceAccount used by the
+ [external-snapshot-session-sidecar](#the-external-snapshot-session-sidecar)
+ of the CSI driver all access in the CSI driver Namespace only.
+- The **SnapshotSessionManager** ClusterRole is used in a
+ ClusterRoleBinding to grant the
+ [Snapshot Session Manager](#the-snapshot-session-manager)
+ the permissions it needs to access all the
+ [custom resources defined by this proposal](#custom-resources).
+
+It is recommended that the security design be reviewed by SIG Security.
## Design Details
@@ -283,6 +525,110 @@ required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
+
+### The SnapshotMetadata Service API
+
+The CSI specification will be extended with the addition of the following new, optional
+**SnapshotMetadata** [gRPC service](https://grpc.io/docs/what-is-grpc/core-concepts/#service-definition).
+The [external-snapshot-session sidecar](#the-external-snapshot-session-sidecar)
+and the [SP snapshot-session-service](#the-sp-snapshot-session-service) plugins
+must both implement this service.
+
+The service is defined as follows, and will be described in the sub-sections below:
+```
+service SnapshotMetadata {
+ rpc GetAllocated(GetAllocatedRequest)
+ returns (stream GetAllocatedResponse) {}
+ rpc GetDelta(GetDeltaRequest)
+ returns (stream GetDeltaResponse) {}
+}
+
+enum BlockMetadataType {
+ FIXED_LENGTH=0;
+ VARIABLE_LENGTH=1;
+}
+
+message BlockMetadata {
+ uint64 byte_offset = 1;
+ uint64 size_bytes = 2;
+}
+
+message GetAllocatedRequest {
+ string session_token = 1;
+ string volume_id = 2;
+ string snapshot = 3;
+ uint64 starting_offset = 4;
+ uint32 max_results = 5;
+}
+
+
+message GetAllocatedResponse {
+ BlockMetadataType block_metadata_type = 1;
+ uint64 volume_size_bytes = 2;
+ repeated BlockMetadata block_metadata = 3;
+}
+
+message GetDeltaRequest {
+ string session_token = 1;
+ string volume_id = 2;
+ string base_snapshot = 3;
+ string target_snapshot = 4;
+ uint64 starting_byte_offset = 5;
+ uint32 max_results = 6;
+}
+
+message GetDeltaResponse {
+ BlockMetadataType block_metadata_type = 1;
+ uint64 volume_size_bytes = 2;
+ repeated BlockMetadata block_metadata = 3;
+}
+```
+
+
+### Kubernetes Components
+The following Kubernetes components are involved at runtime:
+
+- A community provided
+ [Snapshot Session Manager](#the-snapshot-session-manager)
+ that uses a Kubernetes CustomResource (CR) based mechanism to
+ establish a **snapshot session** that provides a backup
+ application with an endpoint for secure TLS gRPC to a
+ [SnapshotMetadata](#the-snapshotmetadata-service-api) service.
+ The manager is independently deployed and serves all
+ CSI drivers that provide a
+ [SnapshotMetadata](#the-snapshotmetadata-service-api) service.
+- A CSI driver provided implementation of the
+ [SnapshotMetadata](#the-snapshotmetadata-service-api) service
+ that is accessible over a UNIX domain transport.
+- A [community provided sidecar](#the-external-snapshot-session-sidecar)
+ that implements the service side of the snapshot session protocol
+ and **proxies** TCP TLS gRPC requests from authorized client applications to the
+ CSI driver's service over the UNIX domain transport.
+
+### Custom Resources
+
+@TODO Prasad to provide description and definitions of the CRs
+#### SnapshotSessionRequest
+
+#### SnapshotServiceConfiguration
+
+@TODO NOT NAMESPACED
+
+#### SnapshotSessionData
+
+@TODO NEED TO DECIDE WHETHER TO EMBED SP IDs OR NOT
+
+### The Snapshot Session Manager
+
+@TODO CARL
+
+### The External Snapshot Session Sidecar
+
+@TODO CARL
+### The SP Snapshot Session Service
+
+@TODO ?
+
### Test Plan
-All unit tests will be included in the out-of-tree CSI repositories, with no
+All unit tests will be included in the out-of-tree CSI repositories, with no
impact on the test coverage of the core packages.
##### Integration tests
@@ -367,13 +713,13 @@ Test setup:
* A sample client to initiate the CBT session and the subsequent CBT GRPC
requests.
* A mock backend snapshot service generates mock responses with CBT payloads to
-be returned to the client.
+be returned to the client.
Test scenarios:
* Verify the CBT request/response flow from the client to the CSI driver.
* Verify that the CBT controller can discover the CBT-enabled CSI driver.
-* Verify the mutating webhook's ability to ensure authorized access to the
+* Verify the mutating webhook's ability to ensure authorized access to the
volume snapshots.
* Token management: TBD
@@ -513,7 +859,7 @@ well as the [existing list] of feature gates.
- [x] Other
- Describe the mechanism: The new components will be implemented as part of the
out-of-tree CSI framework. Storage providers can embed the CBT sidecar component
-in their CSI drivers, if they choose to support this feature. Users will also
+in their CSI drivers, if they choose to support this feature. Users will also
need to install the CBT controller and mutating webhook.
- Will enabling / disabling the feature require downtime of the control
plane? No.
@@ -547,7 +893,7 @@ cluster and remove the CBT sidecar from the CSI driver.
###### What happens if we reenable the feature if it was previously rolled back?
-No effects as all custom resources would have been removed when the CBT
+No effects as all custom resources would have been removed when the CBT
controller was previously uninstalled.
###### Are there any tests for feature enablement/disablement?
@@ -635,10 +981,10 @@ Recall that end users cannot usually observe component logs or access metrics.
-->
- [ ] Events
- - Event Reason:
+ - Event Reason:
- [ ] API .status
- - Condition name:
- - Other field:
+ - Condition name:
+ - Other field:
- [ ] Other (treat as last resort)
- Details:
@@ -850,18 +1196,18 @@ information to express the idea and why it was not acceptable.
-->
The aggregated API server solution described in [#3367][0] was deemed unsuitable
-because of the potentially large amount of CBT payloads that will be proxied
+because of the potentially large amount of CBT payloads that will be proxied
through the K8s API server. Further discussion can be found in this [thread][1].
-An approach based on using volume populator to store the CBT payloads on-disk,
-instead of sending them over the network was also considered. But the amount of
-pod creation/deletion churns and latency incurred made this solution
+An approach based on using volume populator to store the CBT payloads on-disk,
+instead of sending them over the network was also considered. But the amount of
+pod creation/deletion churns and latency incurred made this solution
inappropriate.
-The previous design which involved generating and returning a RESTful callback
-endpoint to the caller, to serve CBT payloads was superceded by the aggregation
-extension mechanism as described in [#3367][0], due to the requirement for more
-structured request and response payloads.
+The previous design which involved generating and returning a RESTful callback
+endpoint to the caller, to serve CBT payloads was superceded by the aggregation
+extension mechanism as described in [#3367][0], due to the requirement for more
+structured request and response payloads.
## Infrastructure Needed (Optional)
diff --git a/keps/sig-storage/3314-csi-changed-block-tracking/roles.drawio.svg b/keps/sig-storage/3314-csi-changed-block-tracking/roles.drawio.svg
new file mode 100644
index 00000000000..80bf2e8e8ee
--- /dev/null
+++ b/keps/sig-storage/3314-csi-changed-block-tracking/roles.drawio.svg
@@ -0,0 +1,538 @@
+
\ No newline at end of file
diff --git a/keps/sig-storage/3314-csi-changed-block-tracking/session.drawio.svg b/keps/sig-storage/3314-csi-changed-block-tracking/session.drawio.svg
new file mode 100644
index 00000000000..31a853cc30c
--- /dev/null
+++ b/keps/sig-storage/3314-csi-changed-block-tracking/session.drawio.svg
@@ -0,0 +1,1501 @@
+
\ No newline at end of file