Skip to content

Conversation

@xBis7
Copy link
Contributor

@xBis7 xBis7 commented Oct 11, 2022

What changes were proposed in this pull request?

After discussions and testing, submitting this patch to enable GRPC server by default between S3G and OM, to improve performance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7309

How was this patch tested?

This patch was tested manually.

@kerneltime
Copy link
Contributor

I would like to see OM peak performance numbers with Hadoop RPC vs grpc. I understand that per client, the performance of grpc is better, but we need to check the server side of the performance as well. Should be a quick test to run.

@kerneltime
Copy link
Contributor

cc @duongkame

@michelsumbul
Copy link

@kerneltime As we said at the apacheCon, we will look at having this number when we have a bit more time, however this should not be a prerequisite to enable that feature by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr

@duongkame
Copy link
Contributor

e by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr

+1. I also think that the full gRpc scalability test is not required to enable the protocol on the server-side by default.

That may be needed when we indeed start using gRpc as the default protocol on the client sides, like S3G or OFS.

@kerneltime
Copy link
Contributor

e by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr

+1. I also think that the full gRpc scalability test is not required to enable the protocol on the server-side by default.

That may be needed when we indeed start using gRpc as the default protocol on the client sides, like S3G or OFS.

Makes sense. I think we should rename the Jira to indicate that this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.

@xBis7 xBis7 changed the title HDDS-7309. Enable by default GRPC between S3G and OM HDDS-7309. Enable by default GRPC between S3G and OM, on OM server Oct 14, 2022
@adoroszlai
Copy link
Contributor

this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.

Maybe I'm missing something, but it seems to me this patch does switch S3 Gateway to use gRPC for talking to OM by default.

OmTransportFactory#createFactory creates the actual factory as an instance of the (first) class specified in the service descriptor, which is GrpcOmTransportFactory for S3 Gateway:

if the value configured for ozone.om.transport.class does not match the value of OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT, which was and still is:

public static final String OZONE_OM_TRANSPORT_CLASS_DEFAULT =
"org.apache.hadoop.ozone.om.protocolPB"
+ ".Hadoop3OmTransportFactory";

Prior to this patch default value of ozone.om.transport.class and OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT did match, so S3 Gateway used the default transport despite its service descriptor. Since the value in ozone-default.xml is being changed, this condition will now be true by default:

// if configured transport class is different than the default
// OmTransportFactory (Hadoop3OmTransportFactory), then
// check service loader for transport class and instantiate it
if (conf
.get(OZONE_OM_TRANSPORT_CLASS,
OZONE_OM_TRANSPORT_CLASS_DEFAULT) !=
OZONE_OM_TRANSPORT_CLASS_DEFAULT) {

So GrpcOmTransportFactory will be created by default for S3 Gateway, i.e. it will switch to the gRPC client.

From ozone compose environment with default configs:

s3g_1       | 2022-10-14 19:32:35,747 [qtp661119548-86] INFO protocolPB.OmTransportFactory: ZZZ created custom OmTransportFactory: org.apache.hadoop.ozone.om.protocolPB.GrpcOmTransportFactory@77d1c0db
s3g_1       | 2022-10-14 19:32:35,920 [qtp661119548-86] INFO protocolPB.GrpcOmTransport: GrpcOmTransport: started

The first log message is one I added in OmTransportFactory.

Also, enabling it only in OM, but not in S3 Gateway, would not have the following benefit:

Enabling it by default will allow to have more people to try it

@xBis7
Copy link
Contributor Author

xBis7 commented Oct 17, 2022

@adoroszlai is right. We are enabling gRPC for S3G here.

@xBis7 xBis7 changed the title HDDS-7309. Enable by default GRPC between S3G and OM, on OM server HDDS-7309. Enable by default GRPC between S3G and OM Oct 17, 2022
@kerneltime
Copy link
Contributor

this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.

Maybe I'm missing something, but it seems to me this patch does switch S3 Gateway to use gRPC for talking to OM by default.

OmTransportFactory#createFactory creates the actual factory as an instance of the (first) class specified in the service descriptor, which is GrpcOmTransportFactory for S3 Gateway:

if the value configured for ozone.om.transport.class does not match the value of OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT, which was and still is:

public static final String OZONE_OM_TRANSPORT_CLASS_DEFAULT =
"org.apache.hadoop.ozone.om.protocolPB"
+ ".Hadoop3OmTransportFactory";

Prior to this patch default value of ozone.om.transport.class and OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT did match, so S3 Gateway used the default transport despite its service descriptor. Since the value in ozone-default.xml is being changed, this condition will now be true by default:

// if configured transport class is different than the default
// OmTransportFactory (Hadoop3OmTransportFactory), then
// check service loader for transport class and instantiate it
if (conf
.get(OZONE_OM_TRANSPORT_CLASS,
OZONE_OM_TRANSPORT_CLASS_DEFAULT) !=
OZONE_OM_TRANSPORT_CLASS_DEFAULT) {

So GrpcOmTransportFactory will be created by default for S3 Gateway, i.e. it will switch to the gRPC client.

From ozone compose environment with default configs:

s3g_1       | 2022-10-14 19:32:35,747 [qtp661119548-86] INFO protocolPB.OmTransportFactory: ZZZ created custom OmTransportFactory: org.apache.hadoop.ozone.om.protocolPB.GrpcOmTransportFactory@77d1c0db
s3g_1       | 2022-10-14 19:32:35,920 [qtp661119548-86] INFO protocolPB.GrpcOmTransport: GrpcOmTransport: started

The first log message is one I added in OmTransportFactory.

Also, enabling it only in OM, but not in S3 Gateway, would not have the following benefit:

Enabling it by default will allow to have more people to try it

Thank you @adoroszlai.

@xBis7 would be possible to get peak OM performance numbers, I would like to validate that the peak OM performance stays the same or is better with GRPC. I understand that each client is seeing better performance but it would be prudent to check the peak performance on OM as well.

@captainzmc
Copy link
Member

Hi @adoroszlai @smengcl , In dev email communication, this PR is required by Ozone 1.3. Currently the release of Ozone-1.3 is blocked on this PR. Could you help continue to review this PR? cc @neils-dev

@adoroszlai
Copy link
Contributor

adoroszlai commented Nov 9, 2022

@captainzmc, there is no need for further review of this PR, it is good for master, can be merged (based on the assumption that @smengcl's only concern was integration test coverage). Backporting it to 1.3 is a different question.

I don't see dev communication that this is blocking the release. I see a request to include it in the release if possible.

I'm not in favor of making such a change so late in the 1.3 release cycle. Enabling OM gRPC is a matter of simple config change, anyone can do it manually. I think it makes sense to first ship it disabled by default, let users enable on an as-needed basis and provide feedback, then enable in the next release based on that.

@captainzmc
Copy link
Member

I'm not in favor of making such a change so late in the 1.3 release cycle. Enabling OM gRPC is a matter of simple config change, anyone can do it manually.

@adoroszlai Agree with you. In fact, users can change the configuration to decide whether to use grpc. I will remove this issue from the block list in 1.3. Then we can start preparing for 1.3.0-rc0.

@neils-dev
Copy link
Contributor

Thanks @captainzmc for following up for this PR on the 1.3.0-rc branch and @adoroszlai for comments on that. I did request that this PR be included in the 1.3.0-rc but the comment from @adoroszlai is great, with it merged into the master as the default s3g config and allowing the 1.3.0-rc preparations without it,

@captainzmc, there is no need for further review of this PR, it is good for master, can be merged (based on the assumption that @smengcl's only concern was integration test coverage).
...remove this issue from the block list in 1.3. Then we can start preparing for 1.3.0-rc0.

@neils-dev neils-dev added the gr label Jan 26, 2023
@DaveTeng0
Copy link
Contributor

Hello guys~ looks like this PR is ready to be merged. Maybe we could go ahead to merge?

@kerneltime
Copy link
Contributor

@DaveTeng0 There is one bit of information we are waiting for, we need to measure the peak OM performance for Hadoop RPC vs GRPC and then this can be merged if there is no regression.

@kerneltime
Copy link
Contributor

@xBis7 we did some initial tests and OM peak performance is about the same between the 2.
Can you rebase this change?

@adoroszlai
Copy link
Contributor

@xBis7 this test error seems to be related, failed in both fork and PR runs:

Error:  Tests run: 17, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 166.658 s <<< FAILURE! - in org.apache.hadoop.ozone.TestSecureOzoneCluster
Error:  org.apache.hadoop.ozone.TestSecureOzoneCluster.testOMGrpcServerCertificateRenew  Time elapsed: 15.596 s  <<< ERROR!
java.io.IOException: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: scm/fv-az210-921.jgpbknbp5ljuxny3rrcz4t2bbd.bx.internal.cloudapp.net@EXAMPLE.COM, expecting: om/fv-az210-921.jgpbknbp5ljuxny3rrcz4t2bbd.bx.internal.cloudapp.net@EXAMPLE.COM

https://github.com/xBis7/ozone/actions/runs/4534441887/jobs/7988627717#step:5:3707
https://github.com/apache/ozone/actions/runs/4534442441/jobs/7988632110#step:5:3707

@xBis7
Copy link
Contributor Author

xBis7 commented Mar 27, 2023

@adoroszlai Thanks for updating the PR and pointing out the test failure. I'll take a look.

@xBis7
Copy link
Contributor Author

xBis7 commented Mar 28, 2023

We added this file /hadoop-ozone/integration-test/src/test/resources/META-INF/services/org.apache.hadoop.ozone.om.protocolPB.OmTransportFactory under integration-test which was forcing tests to use Hadoop RPC despite their configurations. The test that was failing, needed to use gRPC and had been configured to do so but this file was overriding the setting.

I've removed it and run a workflow on a dummy branch successfully.
https://github.com/xBis7/ozone/actions/runs/4544882681

@adoroszlai adoroszlai merged commit e61f348 into apache:master Apr 19, 2023
@adoroszlai
Copy link
Contributor

Thanks @xBis7 for the patch, @duongkame, @kerneltime, @smengcl for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Apr 20, 2023
* master: (440 commits)
  HDDS-8445. Move PlacementPolicy back to SCM (apache#4588)
  HDDS-8335. ReplicationManager: EC Mis and Under replication handlers should handle overloaded exceptions (apache#4593)
  HDDS-8355. Intermittent failure in TestOMRatisSnapshots#testInstallSnapshot (apache#4592)
  HDDS-8444. Increase timeout of CI build (apache#4586)
  HDDS-8446. Selective checks: handle change in ci.yaml (apache#4587)
  HDDS-8440. Ozone Manager crashed with ClassCastException when deleting FSO bucket. (apache#4582)
  HDDS-7309. Enable by default GRPC between S3G and OM (apache#3820)
  HDDS-8458. Mark TestBlockDeletion#testBlockDeletion as flaky
  HDDS-8385. Ozone can't process snapshot when service UID > 2097151 (apache#4580)
  HDDS-8424: Preserve legacy bucket getKeyInfo behavior (apache#4576)
  HDDS-8453. Mark TestDirectoryDeletingServiceWithFSO#testDirDeletedTableCleanUpForSnapshot as flaky
  HDDS-8137. [Snapshot] SnapDiff to use tombstone entries in SST files (apache#4376)
  HDDS-8270. Measure checkAccess latency for Ozone objects (apache#4467)
  HDDS-8109. Seperate Ratis and EC MisReplication Handling (apache#4577)
  HDDS-8429. Checkpoint is not closed properly in OMDBCheckpointServlet (apache#4575)
  HDDS-8253. Set ozone.metadata.dirs to temporary dir if not defined in S3 Gateway (apache#4455)
  HDDS-8400. Expose rocksdb last sequence number through metrics (apache#4557)
  HDDS-8333. ReplicationManager: Allow partial EC reconstruction if insufficient nodes available (apache#4579)
  HDDS-8147. Introduce latency metrics for S3 Gateway operations (apache#4383)
  HDDS-7908. Support OM Metadata operation Generator in `Ozone freon` (apache#4251)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants