Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions hadoop-hdds/docs/content/design/s3-performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
---
title: Proposed persistent OM connection for S3 gateway
summary: Proposal to use per-request authentication and persistent connections between S3g and OM
date: 2020-11-09
jira: HDDS-4440
status: accepted
author: Márton Elek
---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

# Overview

* Hadoop RPC authenticate the calls at the beginning of the connections. All the subsequent messages on the same call will use existing, initialized authentication.
* S3 gateway sends the authentication as Hadoop RPC delegation token for **each requests**.
* To authenticate each of the S3 REST requests Ozone creates a new `OzoneClient` for eac HTTP requests, which introduces problems with performance and error handling.
* This proposal suggests to create a new transport (**in addition** to the existing Hadoop RPC) for the OMClientProtocol where the requests can be authenticated per-request.

# Authentication with S3 gateway

AWS S3 request authentication based on [signing the REST messages](https://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html). Each of the HTTP requests must include and authentication header which contains the used the *access key id* and a signatures created with the help of the *secret key*.

```
Authorization: AWS AWSAccessKeyId:Signature
```

Ozone S3g is a REST gateway for Ozone which receives AWS compatible HTTP calls and forwards the requests to the Ozone Manager and Datanode services. Ozone S3g is **stateless**, it couldn't check any authentication information which are stored on the Ozone Manager side. It can check only the format of the signature.

For the authentication S3g parses the HTTP header and sends all the relevant (and required) information to Ozone Manager which can check the signature with the help of stored *secret key*.

This is implemented with the help of the delegation token mechanism of Hadoop RPC. Hadoop RPC supports Kerberos and token based authentication where tokens can be customized. The Ozone specific implementation `OzoneTokenIdentifier` contains a `type` field which can `DELEGATION_TOKEN` or `S3AUTHINFO`. The later one is used to authenticate the request based on S3 REST header (signature + required information).

Both token and Kerberos based authentication are checked by Hadoop RPC during the connection initialization phase using the SASL standard. SASL defines the initial handshake of the creation where server can check the authentication information with a challenge-response mechanism.

As a result Ozone S3g requires to create a new Hadoop RPC client for each of the HTTP requests as each requests may have different AWS authentication information / signature. Ozone S3g creates a new `OzoneClient` for each of the requests which includes the creation of Hadoop RPC client.

There are two problems with this approach:

1. **performance**: Creating a new `OzoneClient` requires to create new connection, to perform the SASL handshake and to send the initial discovery call to the OzoneManager to get the list of available services. It makes S3 performance very slow.
2. **error handling:** Creating new `OzoneClient` for each requests makes the propagation of error code harder with CDI.

[CDI](http://cdi-spec.org/) is the specification of *Contexts and Dependency Injection* for Java. Can be used for both JavaEE and JavaSE and it's integrated with most web frameworks. Ozone S3g uses this specification to inject different services to to REST handlers using `@Inject` annotation.

`OzoneClient` is created by the `OzoneClientProducer`:

```
@RequestScoped
public class OzoneClientProducer {

private OzoneClient client;

@Inject
private SignatureProcessor signatureParser;

@Inject
private OzoneConfiguration ozoneConfiguration;

@Inject
private Text omService;

@Inject
private String omServiceID;


@Produces
public OzoneClient createClient() throws OS3Exception, IOException {
client = getClient(ozoneConfiguration);
return client;
}
...
}
```

As we can see here, the producer is *request* scoped (see the annotation on the class), which means that the `OzoneClient` bean will be created for each request. If the client couldn't be created a specific exception will be thrown by the CDI framework (!) as one bean couldn't be injected with CDI. This error is different from the regular business exceptions therefore the normal exception handler (`OS3ExceptionMapper` implements `javax.ws.rs.ext.ExceptionMapper`) -- which can transform exceptions to HTTP error code -- doesn't apply. It can cause strange 500 error instead of some authentication error.

## Caching

Hadoop RPC has a very specific caching layer which is **not used** by Ozone S3g. This section describe the caching of the Hadoop RPC, but safe to skip (It explain how is the caching ignored).

As creating new Hadoop RPC connection is an expensive operation Hadoop RPC has an internal caching mechanism to cache client and connections (!). This caching is hard-coded and based on static fields (couldn't be adjusted easily).

Hadoop RPC client is usually created by `RPC.getProcolProxy`. For example:

```
HelloWorldServicePB proxy = RPC.getProtocolProxy(
HelloWorldServicePB.class,
scmVersion,
new InetSocketAddress(1234),
UserGroupInformation.getCurrentUser(),
configuration,
new StandardSocketFactory(),
Client.getRpcTimeout(configuration),
retryPolicy).getProxy();
```

This code fragment creates a new client which can be used from the code, and it uses multiple caches for client creation.

1. Protocol engines are cached by `RPC.PROTOCOL_ENGINES` static field, but it's safe to assume that the `ProtobufRpcEngine` is used for most of the current applications.

2. `ProtobufRpcEngine` has a static `ClientCache` field which caches the client instances with the `socketFactory` and `protocol` as the key.

3. Finally the `Client.getConnection` method uses a cache to cache the connections:

```
connection = connections.computeIfAbsent(remoteId,
id -> new Connection(id, serviceClass, removeMethod));
```

The key for the cache is the `remoteId` which includes all the configuration, connection parameters (like destination host) and `UserGroupInformation` (UGI).

The caching of the connections can cause very interesting cases. As an example, let's assume that delegation token is invalidated with an RPC call. The workflow can be something like this:

1. create protocol proxy (with token authentication)
2. invalidate token (rpc call)
3. close protocol proxy (connection may not be closed. depends from the cache)
4. create a new protocol proxy
5. If connection is cached (same UGI) services can be used even if the token is invalidated earlier (as the token is checked during the initialization of the tokens).

Fortunately this behavior doesn't cause any problem in case of Ozone and S3g. UGI (which is part of the cache key of the connection cache) equals if (and only if) the underlying `Subject` is the same.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify my understanding, we don't have invalidate token method for s3G right?

As this token is generated from Client auth header fields. Means token is generated per request.

We are sending the required info to validate the auth header with the secret which OM has.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify my understanding, we don't have invalidate token method for s3G right?

You are right. This example is independent of s3g just explains how the cache works. I tried to describe the problem with the simple delegation token. (I can add it as a note)


```
public class UserGroupInformation {

...

@Override
public boolean equals(Object o) {
if (o == this) {
return true;
} else if (o == null || getClass() != o.getClass()) {
return false;
} else {
return subject == ((UserGroupInformation) o).subject;
}
}
}
```

But the UGI initialization of Ozone always creates a new `Subject` instance for each request (even if the subject name is the same). In `OzoneClientProducer`:

```
UserGroupInformation remoteUser =
UserGroupInformation.createRemoteUser(awsAccessId); // <-- new Subject is created

if (OzoneSecurityUtil.isSecurityEnabled(config)) {
try {
OzoneTokenIdentifier identifier = new OzoneTokenIdentifier();
//setup identifier

Token<OzoneTokenIdentifier> token = new Token(identifier.getBytes(),
identifier.getSignature().getBytes(UTF_8),
identifier.getKind(),
omService);
remoteUser.addToken(token);
....
```

**As a result Hadoop RPC caching doesn't apply to Ozone S3g**. It's a good news because it's secure, but bad news as the performance is bad.

# Proposed change

We need an RPC mechanism between the Ozone S3g service and Ozone Manager service which can support per-request authentication and accepts

The Ozone Manager client already has a pluggable transport interface: `OmTransport` is a simple interface which can deliver `OMRequest` messages:

```
public interface OmTransport {

/**
* The main method to send out the request on the defined transport.
*/
OMResponse submitRequest(OMRequest payload) throws IOException;
...
```

The proposal is to create a new **additional** transport, based on GRPC, which can do the per-request authentication. **Existing Hadoop clients will use the well-known Hadoop RPC client**, but S3g can start to use this specific transport to achieve better performance.

As this is nothing more, just a transport: exactly the same messages (`OmRequest`) will be used, it's not a new RPC interface.

Only one modification is required in the RPC interface: a new per-request`token` field should be introduced in `OMRequest` which is optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To protect the token from being stolen, TLS must be enabled for GRPC.
To set up TLS for GRPC, the client must get the CA cert via service discovery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To protect the token from being stolen, TLS must be enabled for GRPC.

Yes, 100% agree.

To set up TLS for GRPC, the client must get the CA cert via service discovery.

Fix me If I am wrong, but CA certificate is also downloaded during the datanode initialization and can be used.

But anyway: as I suggest to use OzoneClient (but with new OM transport), serviceDiscovery call will be executed as before, but instead of calling once for each S3 HTTP request, it will be called only once one connection is added to the connection pool.


A new GRPC service should be started in Ozone Manager, which receives `OMRequest` and for each request, the Hadoop `UserGroupInformation` is set based on the new token field (after authentication).

`OzoneToken` identifier can be simplified (after deprecation period) with removing the S3 specific part, as it won't be required any more.

With this approach the `OzoneClient` instances can be cached on S3g side (with persistent GRPC connections) as the authentication information is not part of the OzoneClient any more (added by the `OmTransport` implementation per request (in case of GRPC) or per connection (in case of HadoopRPC)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean with this approach we need one ozone Client instantiated, as token is part of OMRequest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions:
This means S3G does not use hadoop Rpc Client, it will be use GrpcClient
So how OM HA will be handled retry handling logic, so all that logic need to be implemented in this new GrpcClient?
And once the token is validated, will it go with the normal flow of execution in OzoneManager?

Few minor questions, as I don't have much expertise on Grpc Implementation.

  1. Does GrpcServer also will have RPC handler threads where requests can be handled parallel on OM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean with this approach we need one ozone Client instantiated, as token is part of OMRequest.

Yes.

And once the token is validated, will it go with the normal flow of execution in OzoneManager?

Yes, exactly the same logic.

Does GrpcServer also will have RPC handler threads where requests can be handled parallel on OM.

Yes. As far as I understood from the documentation the new thread is created by the async IO handler thread.

But if we need more freedom, we can always introduce a simple Executor.

But it's a very good question. Thinking about this, I have new ideas: with separating S3g and client side traffic we can monitor the two in different way (for example compare queue time of client and s3g calls, or set priorities). Not in this step, but something which will be possible.

How OM HA will be handled retry handling logic, so all that logic need to be implemented in this new GrpcClient?

Yes. We need to take care about the retry logic. My initial suggestion is to create 3, persistent connection to all the OM, and in case of not leader exception try to send the message on a different connection.

In case of client it can be expensive (always create 3 different connection to 3 different OM HA instance), but in case of S3g and persistent connections it seems to more effective as the connections are persistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it is like a new retry logic should be implemented for GrpcClient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank You for detailed answers for other points.

I have new ideas: with separating S3g and client side traffic we can monitor the two in different way (for example >compare queue time of client and s3g calls, or set priorities). Not in this step, but something which will be possible.

This idea looks interesting, but at end, both are coming from end clients, so getting additional metrics helps to understand better calls from S3/other interface, but not sure in which scenarios this will help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example to understand / compare the cluster usage. Which part is used more HCFS or s3? What is the source of small files s3 or HCFS?

This (different metrics) is not something to do right now but an interesting option to think forward if this approach is accepted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it is like a new retry logic should be implemented for GrpcClient.

Yes. And I argue that this logic can be optimized for servers (connections to different OM instances can be cached long-term) and not only optimized for client (open second connections to the right OM only in case of leader election)


To make it easier to understand the implementation of this approach, let's compare the old (existing) and new (proposed) approach.

### Old approach

1. OzoneClientProvider creates a new OzoneClient for each of the HTTP requests (@RequestScope)

2. OzoneToken is created based on the authentication header (signature, string to sign)

3. OzoneClient creates new OM connection (Hadoop RPC) with the new OzoneToken

4. OM extracts the information from OzoneToken and validates the signature

5. OzoneClient is injected to the REST endpoints

6. OzoneClient is used for any client calls

7. When HTTP request is handled, the OzoneClient is closed

![old approach](s3-performance-old.png)

### New approach

1. OzoneClientProvider creates client with @ApplicationScope. Connection is always open to the OM (clients can be pooled)
2. OM doesn't authentication the connection, each of the requests should be authenticated one-by-one
3. OzoneClients are always injected to the REST endpoints
4. For each new HTTP request the authorization header is extracted and added to the outgoing GRPC request as a token
5. OM authenticate each of the request and calls the same request handler as before
6. OzoneClient can be open long-term

![new approach](s3-performance-new.png)

# Possible alternatives

* It's possible to use pure Hadoop RPC client instead of Ozone Client which would make the client connection slightly cheaper (service discovery call is not required) but it's still require to create new connections for each requests (and downloading data without OzoneClient may have own challenges).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not understood this point, what is meant by service discovery call is not required and also not using ozone client may have own challenge.

So, can we use one single client even with Hadoop RPC? More information on this point will help what is meant by this alternative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not understood this point, what is meant by service discovery call is not required and also not using ozone client may have own challenge.

When you use OzoneClient an initial service discovery call will be executed at the beginning. But after that you can use it easily. Both OM client connection and datanode connections are managed by OzoneClient.

We can try to use pure OM Client call (without using OzoneClient just to use Hadoop RPC client API) to avoid service discovery, but in that case we couldn't use OzoneClient. As OzoneClient contains the client logic for datanode, without OzoneClient the OM Client calls can be more simple, but at the end the solution can be more complex as we should use a lower level datanode client api, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, so for OM API's we want to use direct omClient instead of coming via ozone client to save service discovery, but for dn we need still ozone client.
But how token authentication will happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, so for OM API's we want to use direct omClient instead of coming via ozone client to save service discovery, but for dn we need still ozone client.

Yes, that is added as possible alternative (using pure om client API + OzoneClient for datanode), but I don't like it:

  1. Authentication (as you asked) is not solved here, still you need per request connection
  2. Using OM client + OzoneClient for datanode is not straightforward, requires more work.

Therefore, I suggested to use a different approach. (use OzoneClient but create a new OMTransport implementation based on GRPC).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel we can reuse the Hadoop Rpc connection here.
Don't remember exactly why we have to use a token user and do the token validation at OM. But another solution I would like to propose is to use Proxy user at S3g:

Instead of wrap the token to create a new Hadoop RPC connection per call. S3g can validate OM token similar to the way DN validate OM block token. After validation succeeds, S3g can create a proxy user to connect to OM. If it is the same client, the proxy user can be reused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the comments @xiaoyuyao

We had an offline discussion and I try to summarize what we discussed.

.... S3g can validate OM token similar to the way DN validate OM block token

  1. We couldn't move the authentication from OM to S3g as it's based on asymmetric encryption (requests are signed with a private key and OM re-produce the signature with the stored secret). Private access key shouldn't be moved out from OM. Therefore, S3g couldn't do authentication.

  2. PROXY_USER itself is per-connection (AFAIK), it doesn't fully solve the problem. We can do per-user Hadoop RPC connection caching, but despite the complexity it's not a full solution in an environment where we have thousands of users.

  3. Also, the per-user Hadoop RPC connection caching on s3g side has some difficulties. The current caching logic is hard coded in static fields. To cache connection per user, s3g trust the user information, which is not possible. A request with a signature which is in valid format, but created with fake access key, couldn't re-use the cached and authenticated connection of the user.

* CDI error handling can be improved with using other dependency injection (eg. Guice, which is already used by Recon) or some additional wrappers and manual connection creation. But it wouldn't solve the perfomance problem.



Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.