Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0078: Zone-Aware Replica Reads #136

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Coding happens all the time and is encouraged. We just recognize there is a poin
| 64 | [SDK3 Field-Level Encryption](rfc/0064-sdk3-field-level-encryption.md) | David N. | ACCEPTED |
| 69 | [KV Error Map V2](rfc/0069-kv-error-map-v2.md) | Brett L. | ACCEPTED |
| 76 | [KV Subdoc Replica Reads](rfc/0076-subdoc-replica-reads.md) | Charles D. | ACCEPTED |
| 78 | [Zone-Aware Replica Reads](rfc/0077-zone-aware-replica-reads.md) | Sergey | ACCEPTED |

### Draft & Review RFCs

Expand All @@ -68,7 +69,6 @@ Coding happens all the time and is encouraged. We just recognize there is a poin
| 74 | Configuration Profiles [\[doc\]](https://docs.google.com/document/d/1LNCYgV2Eqymp3pGmA8WKPQOLSpcRyv0P7NpMYHVcUM0/) | Mike R. | DRAFT |
| 75 | [Faster Failover and Configuration Push](https://github.com/couchbaselabs/sdk-rfcs/pull/123) | Sergey | DRAFT |
| 77 | couchbase2 support [\[gdoc\]](https://docs.google.com/document/d/1BZe7m6cT5SqUO86W4si9gdNqBOGusYAOW0JLXoe47qU/edit) | Graham P. | DRAFT |
| 78 | [Zone-Aware Replica Reads](https://github.com/couchbaselabs/sdk-rfcs/pull/136) | Sergey | DRAFT |

### Identified RFCs

Expand Down
359 changes: 359 additions & 0 deletions rfc/0078-zone-aware-replica-reads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
# Meta

| Field | Value |
|----------------|--------------------------|
| RFC Name | Zone-Aware Replica Reads |
| RFC ID | 78 |
| Start Date | 2024-04-25 |
| Owner | Sergey Avseyev |
| Current Status | DRAFT |
| Revision | #1 |

# Summary

This document describes changes that SDKs must implement to optimize costs of
network traffic when the server is being deployed across multiple availability
zones.

# Motivation

Existing SDKs do not take into account node availability groups of the nodes in
the cluster. In particular, when performing `getAllReplicas` operations, the SDK
would send `GET (0x00)` request to active vBucket and `GET_REPLICA (0x83)` to
each of the replicas. This approach will make reading from replicas expensive
operations if nodes of the cluster deployed in different availability zones
(AZ).

# Use Cases

## No Preference

This is current behaviour, where the SDK does not take into account server
groups.

![No Prefrence](figures/0078-case-1-no-preference.svg)

## Selected Server Group

In this case, all reads will be done from all nodes in the selected server group. It
is the cheapest solution, although it reduces chances of getting data from
replicas (if the server groups are unbalanced).

![Selected Server Group](figures/0078-case-2-local-only.svg)

## Selected Server Group with Fallback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avsej this isn't in the ReadPreference enum below - which one is correct?


This strategy does not require extra SDK support, but it is important that the
user will be able to handle cases, when the replica cannot be read because
selected group is empty. The selected server group might be empty in case
the groups are not balanced properly on the cluster, or some nodes has been
failed over. The previous strategy in this case would return
`102 DocumentUnretrievable` error and refuse touching replicas from non-local
group.

![Selected Server Group or All Available](figures/0078-case-3-selected-server-group-or-all-available.svg)
avsej marked this conversation as resolved.
Show resolved Hide resolved

In general, the user must be expecting that the document will not be available
in local group, so

```
try {
return collection.getAnyReplica(docId,
options()
.timeout("20ms")
.read_preference(SELECTED_SERVER_GROUP))
} catch DocumentUnretrievableException | DocumentNotFoundException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avsej for my understanding - I thought the replica methods would never raise DocumentNotFoundException?

return collection.get(docId)
// or collection.getAnyReplica(docId)
// or collection.getAllReplicas(docId)
}
```

# Changes

## Updates in Configuration Parser

Each node in `nodesExt` array of cluster configuration will have new property
`serverGroup`, which contains name of the group as it seen in Admin UI. For
example:

```jsonc
{
// ...
"vBucketServerMap": {
"vBucketMap": [
[ 0, 1 ]
// ... other vBuckets
]
},
// ...
"nodesExt": [
{
// ...
"hostname": "172.17.0.2",
"serverGroup": "group_1"
},
{
// ...
"hostname": "172.17.0.3",
"serverGroup": "group_2"
},
{
// ...
"hostname": "172.17.0.4",
"serverGroup": "group_2"
},
{
// ...
"hostname": "172.17.0.5",
"serverGroup": "group_1"
}
]
// ...
}
```

Having this information, the SDK will be able to filter list of the node
indexes in `vBucketMap` to get local server group members.

For example, lets say the configured SDK to use `"group_1"` as its local server
group. Then if some key is mapped to vBucket `0`, the SDK can filter vector if
indexes `[0, 1]` to only `[0, -1]` (using `-1` here for illustration), because
server with index `1` belongs to `"group_2"`. In the result, the SDK only need
to send `GET` request to retrieve active copy of the document.

## Selecting Server Group for Connection
avsej marked this conversation as resolved.
Show resolved Hide resolved

To allow user to select preferred server groups, new setter/property for
`ClusterOptions` must be introduced:

```
class ClusterOptions {
// ...
ClusterOptions preferredServerGroup(String serverGroupName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a standard name for a corresponding connection string property?

If so, would that nudge us towards calling this just serverGroup which is ~50% easier to type? :-)

// ...
}
```

Where `serverGroupName` matches name of the group as it seen in Admin UI of the
server and in cluster configuration JSON.

## Selecting Read Preference for Operations
avsej marked this conversation as resolved.
Show resolved Hide resolved

New enumeration should be defined for expressing different strategies for the
read replica APIs:

```
enum ReadPreference {
programmatix marked this conversation as resolved.
Show resolved Hide resolved
NO_PREFERENCE,
SELECTED_SERVER_GROUP,
Copy link
Contributor

@dnault dnault Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be PREFERRED_SERVER_GROUP (or SERVER_GROUP) to match the name of the cluster option?

};
```

Each of the operation should get new option, with default value of
`ReadPreference::NO_PREFERENCE`.

```
class GetAllReplicasOptions {
// ...
ReadPreference readPreference { ReadPreference::NO_PREFERENCE };
// ...
};
```

```
class GetAnyReplicaOptions {
// ...
ReadPreference readPreference { ReadPreference::NO_PREFERENCE };
// ...
};
```

```
class LookupInAllReplicasOptions {
// ...
ReadPreference readPreference { ReadPreference::NO_PREFERENCE };
// ...
};
```

```
class LookupInAnyReplicaOptions {
// ...
ReadPreference readPreference { ReadPreference::NO_PREFERENCE };
// ...
};
```

In all failure cases, when the SDK cannot handle operation, it must return `102
DocumentUnretrievable` with human-readable explanation whenever it is possible
that explains the details.

## Transactions
DemetrisChr marked this conversation as resolved.
Show resolved Hide resolved

Inside transaction closure, the replica read API takes simplified form, and
always read from local group, which implies that the SDK will always throw `102
DocumentUnretrievable` error for servers that do not expose `serverGroup`
property in cluster configuration.

```
class TransactionAttemptContext {
avsej marked this conversation as resolved.
Show resolved Hide resolved
// ...
public TransactionGetResult
getReplicaFromPreferredServerGroup(Collection collection, String id,
TransactionGetReplicaOptions options);
// ...
}
```

`TransactionGetReplicaOptions` here does not have any specific options except
common with `get()` method. The user cannot select read preference.

```
class TransactionGetReplicaOptions {
Transcoder transcoder;
}
```

The method might throw the following errors:

* `101 DocumentNotFound`, when the server returns KV Code `0x01 ENOENT` for
*all* requests.

* `102 DocumentUnretrievable`, when the SDK finds that there are nodes in local
group, or there is no group available with the name selected in connection
options.

# Caveats

The user must be aware of the following topics when dealing with
Zone-Aware-Replica-Reads:

- Regardless the fact that the Couchbase supports Server Groups for a long time
already, the SDK can see them in configuration since 7.6.2 release only, which
make Zone-Aware-Replica-Reads to be available only in most recent releases.

- It is crucial to perform rebalance after setting up server groups, as only
after this process the data will be physically moved.

- The User must consult server documentation regarding setting up balanced
cluster for efficient usage of the feature. Number of the replicas and number
of the groups have to be carefully selected.

# Open Questions

## Q1

> What the SDK should do if the local group does not have any replica configured
> for the key, but the User request `Selected-Server-Group` strategy? Consider
> the following cases:
>
> - number of replicas is 1, number of zones is 3. Neither active, nor replica
> is in the local group.
> - number of replicas is 1, number of zones is 3. Only active node is in the
> local group.
> - number of replicas is 1, number of zones is 3. Only replica node is in the
> local group.
>
> Shall we warn users when locality restriction turns request into just GET,
> because only active node is in local group?

The SDK should just implement two strategy, described in the body of the RFC.
Although future editions might add more strategies with retrying with full
replica-set or performing extra validations.

## Q2

> Do we need to implement management API to create/retrieve server groups and
> change node belonging to particular group?

No, we do not need to implement management API. If necessary, REST API should be
used
(https://docs.couchbase.com/server/current/manage/manage-groups/manage-groups.html)
`cbdinocluster` also support server groups since version 0.0.43.

## Q3

> All currently supported server versions allow to configure server groups, but
> only 7.6.1 announce this information in configuration.
> [MB-60835](https://issues.couchbase.com/browse/MB-60835) How the SDK should
> behave with older server? Do we need to fail fast or silently fall back to
> current behavior?

The behavior of this feature with older servers (pre 7.6.2) is not defined. The
User might handle it will falling back to get from active or any of the old APIs
see the fall back snippet in the RFC body.


## Q4

> Is there explicit statement from the server team that mixed-version cluster will
> not announce feature until all nodes migrated to 7.6.1?

No expectation of mixed mode support. The feature requires all nodes to be on
server 7.6.2+, otherwise behavior left undefined. In 7.7, when we have
capability flags, we could support mixed mode.

## Q5

> Moving nodes between groups generates configuration, but does not really moves
> data. The user might forget to trigger rebalance, and the SDK will try to do
> `Selected-Server-Group` strategy, while the operations are still expensive.

The sever generates notice in UI warning the user that rebalance is necessary.
The SDK should not detect whether the vBucket moved or not, and just trust the
vBucketMap in the configuration.

## Q6

> How the SDK should behave during rebalance? It cannot guarantee cheap traffic
> based on availability. Should we retry operations if the configuration says that
> replica has moved to different group?

Rebalance should be considered a transient state and the SDK cannot make any
guarantees about physical location of the vBuckets in the cluster. We just wait
until the cluster returns into balanced state and meanwhile we just trust the
vBucketMap in the configuration.

## Q7

> In case of more than two server groups configured, do we need to distinguish
> between non-local groups? Right now there is no weights associated with the
> groups, so we don't have enough information to reason about cost of
> communication with non-local groups.

The SDK does not reason about any other groups but the one that selected as
preferred.

## Q8

> Do we want to update `GetResult` and `LookupInResult` with the property, which
> would tell the user the name of the availability zone, where serving node has
> been deployed upon generating response?

Right now the structure of responses remains unchanged.

# Changelog
* April 25 2024 - Revision #1
* Initial Draft

* June 10 2024 - Revision #2
* Clarification that the SDK reads from all nodes that qualify the server
group selection.
* Updated title of the third strategy, that highlights the handling on the
empty groups.

# Signoff

| Language | Team Member | Signoff Date | Revision |
|-------------|----------------|--------------|----------|
| .NET | | 2024-MM-DD | |
| Go | | 2024-MM-DD | |
| C/C++ | Sergey Avseyev | 2024-06-10 | #2 |
| Node.js | | 2024-MM-DD | |
| PHP | Sergey Avseyev | 2024-06-10 | #2 |
| Python | | 2024-MM-DD | |
| Ruby | Sergey Avseyev | 2024-06-10 | #2 |
| Java | | 2024-MM-DD | |
| Kotlin | | 2024-MM-DD | |
| Scala | | 2024-MM-DD | |
Loading