HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller username in the metric #3878

xBis7 · 2022-10-24T13:13:20Z

What changes were proposed in this pull request?

On the Prometheus endpoint for the OM, in the DecayRpcScheduler summary for users, the username is exposed in the metric name. It makes almost impossible to monitor these values as every time a new user shows up we need to register a new metrics name.

The metric name from org_apache_hadoop_ipc_decay_rpc_scheduler_volume becomes org_apache_hadoop_ipc_decay_rpc_scheduler_caller_hadoop_volume for a user with hadoop username.

The proposed solution is to filter the logs and remove the username from the metric and add it in a username tag.

This metric comes from hadoop-common-3.3.4.jar/DecayRpcScheduler and more specifically

Metrics2Util.NameValuePair entry = (Metrics2Util.NameValuePair)topNCallers.poll();
String topCaller = "Caller(" + entry.getName() + ")";
String topCallerVolume = topCaller + ".Volume";
String topCallerPriority = topCaller + ".Priority";
rb.addCounter(Interns.info(topCallerVolume, topCallerVolume), entry.getValue());

The name is in the format Caller(username).MetricType eg. Caller(hadoop).Volume. The username might exist in the metric name for a purpose. We don't want to change the way the metric works but the way it's presented. The cleanest way to deal with this seems to filter the metric in PrometheusMetricsSink. We are not making any functional changes but we are only filtering what is presented to the user.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7394

How was this patch tested?

New unit tests were added for this patch. It was also tested manually, with a docker cluster and the OM /prom endpoint.

To test it in a docker environment:

in compose/ozone add in docker-config

CORE-SITE.XML_ipc.9862.callqueue.impl=org.apache.hadoop.ipc.FairCallQueue
CORE-SITE.XML_ipc.9862.scheduler.impl=org.apache.hadoop.ipc.DecayRpcScheduler
CORE-SITE.XML_ipc.9862.scheduler.priority.levels=2
CORE-SITE.XML_ipc.9862.backoff.enable=true
CORE-SITE.XML_ipc.9862.faircallqueue.multiplexer.weights=99,1
CORE-SITE.XML_ipc.9862.decay-scheduler.thresholds=90
OZONE-SITE.XML_ozone.om.address=0.0.0.0:9862

then

$ export COMPOSE_FILE=docker-compose.yaml:monitoring.yaml
$ docker-compose up --scale datanode=3 -d
$ docker exec -it ozone_s3g_1 bash
bash-4.2$ export AWS_ACCESS_KEY=test AWS_SECRET_KEY=pass
bash-4.2$ ozone freon s3bg -t 1 -n 10

on your browser go to http://localhost:9874/prom and you should see

# TYPE org_apache_hadoop_ipc_decay_rpc_scheduler_priority counter
org_apache_hadoop_ipc_decay_rpc_scheduler_priority{context="ipc.9862",hostname="e32f2e3bddb9",username="hadoop"} 1
...
...
...
# TYPE org_apache_hadoop_ipc_decay_rpc_scheduler_volume counter
org_apache_hadoop_ipc_decay_rpc_scheduler_volume{context="ipc.9862",hostname="e32f2e3bddb9",username="hadoop"} 21

kerneltime · 2022-10-24T16:15:28Z

CC @duongkame

xBis7 · 2022-11-02T17:48:12Z

@kerneltime any updates on this?

duongkame · 2022-11-04T00:13:29Z

Thanks @xBis7 for the patch.

The name is in the format Caller(username).MetricType eg. Caller(hadoop).Volume. The username might exist in the metric name for a purpose. We don't want to change the way the metric works but the way it's presented. The cleanest way to deal with this seems to filter the metric in PrometheusMetricsSink.

I still think this should rather be a change in hadoop-common. To keep it backward compatible, we may just introduce a new monitorable metric without changing the existing one. Capturing and modifying metrics in Ozone PrometheusMetricsSink looks pretty hacky, but if we have to do so, we'd rather copy the metrics to a new one without changing the original metric.

kerneltime · 2022-11-04T05:45:37Z

@xBis7 I understand the need, but I think there was intent in the original metric, and this comes from being able to diagnose which application/user is seeing what metric.
From: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html

The implementation of RpcScheduler used with FairCallQueue by default is DecayRpcScheduler, which maintains a count of requests received for each user.

Is this for ofs use cases or s3 based access?

xBis7 · 2022-11-04T08:36:47Z

@kerneltime @duongkame Thanks for taking a look at the patch.

I think there was intent in the original metric, and this comes from being able to diagnose which application/user is seeing what metric.

I agree, that's why there are no functional changes. The metric that gets passed back and forth is exactly the same and we are only filtering how the endpoint presents it, like here with the RocksDb metrics.

Is this for ofs use cases or s3 based access?

It was used with S3 but it could be for both use cases.

I still think this should rather be a change in hadoop-common.

That was my first thought but it seems too complex for what we need. We don't want to change the metric or reduce the times it gets registered. We need the name to be consistent, so that it's easier to track with a wildcard search or something like that, but still want to be able to see the user for every registry.
If you think that's the best, I'm happy to move forward with it. We don't have to make the change to hadoop-commons but instead create a class in Ozone, that overrides the one from the jar.

neils-dev · 2022-11-04T19:02:35Z

Is this for ofs use cases or s3 based access?

This applies to both @kerneltime. This prometheus metric, org_apache_hadoop_ipc_decay_rpc_scheduler_caller_* includes the user in the metric name on all rpc metric events.

hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/DecayRpcSchedulerUtil.java

…ser summary

neils-dev · 2022-12-01T01:46:06Z

Thanks @xBis7 for this important patch. Thanks @duongkame and @kerneltime for reviewing this PR and for your comments.

…ername in the metric (apache#3878)

* master: (110 commits) HDDS-7472. EC: Fix NSSummaryEndpoint#getDiskUsage for EC keys (apache#3987) HDDS-5704. Ozone URI syntax description in help content needs to mention about ozone service id (apache#3862) HDDS-7555. Upgrade Ratis to 2.4.2-8b8bdda-SNAPSHOT. (apache#4028) HDDS-7541. FSO recursive delete directory with hierarchy takes much time for cleanup (apache#4008) HDDS-7581. Fix update-jar-report for snapshot (apache#4034) HDDS-7253. Fix exception when '/' in key name (apache#4038) HDDS-7579. Use Netty 4.1.77 for consistency (apache#4031) HDDS-7562. Suppress warning about long filenames in tar (apache#4017) HDDS-7563. Add a handler for under replicated Ratis containers in RM (apache#4025) HDDS-7497. Fix mkdir does not update bucket's usedNamespace (apache#3969) HDDS-7567. Invalid entries in LICENSE (apache#4020) HDDS-7575. Correct showing of RATIS-THREE icon in Recon UI (apache#4026) HDDS-7540. Let reusable workflow inherit secrets (apache#4012) HDDS-7568. Bump copyright year in NOTICE (apache#4018) HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller username in the metric (apache#3878) HDDS-7510. Recon: Return number of open containers in `/clusterState` endpoint (apache#3989) HDDS-7561. Improve setquota, clrquota CLI usage (apache#4016) HDDS-6615. EC: Improve write performance by pipelining encode and flush (apache#3994) HDDS-7554. Recon UI should show DORMANT in pipeline status filter (apache#4010) HDDS-7540. Separate scheduled CI from push/PR workflows (apache#4004) ...

kerneltime requested review from kerneltime and sodonnel October 24, 2022 16:14

duongkame approved these changes Nov 16, 2022

View reviewed changes

kerneltime reviewed Nov 22, 2022

View reviewed changes

hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/DecayRpcSchedulerUtil.java Outdated Show resolved Hide resolved

xBis7 added 2 commits November 22, 2022 11:14

filter DecayRpcSchedulerUtil class to filter prom metrics for decay u…

f968b69

…ser summary

USERNAME_LIST removed, findbugs error fixed

50642fc

xBis7 force-pushed the HDDS-7394 branch from 97998a7 to 50642fc Compare November 22, 2022 09:14

TestDecayRpcSchedulerUtil class

e12a26f

kerneltime approved these changes Nov 30, 2022

View reviewed changes

kerneltime merged commit 4de7d56 into apache:master Nov 30, 2022

Galsza pushed a commit to Galsza/ozone that referenced this pull request Dec 7, 2022

HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller us…

ea233d1

…ername in the metric (apache#3878)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller username in the metric #3878

HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller username in the metric #3878

Uh oh!

xBis7 commented Oct 24, 2022 •

edited

Loading

Uh oh!

kerneltime commented Oct 24, 2022

Uh oh!

xBis7 commented Nov 2, 2022

Uh oh!

duongkame commented Nov 4, 2022

Uh oh!

kerneltime commented Nov 4, 2022

Uh oh!

xBis7 commented Nov 4, 2022 •

edited

Loading

Uh oh!

neils-dev commented Nov 4, 2022 •

edited

Loading

Uh oh!

Uh oh!

neils-dev commented Dec 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller username in the metric #3878

HDDS-7394. OM RPC FairCallQueue decay decision metrics list caller username in the metric #3878

Uh oh!

Conversation

xBis7 commented Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

kerneltime commented Oct 24, 2022

Uh oh!

xBis7 commented Nov 2, 2022

Uh oh!

duongkame commented Nov 4, 2022

Uh oh!

kerneltime commented Nov 4, 2022

Uh oh!

xBis7 commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neils-dev commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

neils-dev commented Dec 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xBis7 commented Oct 24, 2022 •

edited

Loading

xBis7 commented Nov 4, 2022 •

edited

Loading

neils-dev commented Nov 4, 2022 •

edited

Loading