diff --git a/hadoop-hdds/docs/content/feature/Observability.md b/hadoop-hdds/docs/content/feature/Observability.md index cab68780912e..1ee95d8ade9a 100644 --- a/hadoop-hdds/docs/content/feature/Observability.md +++ b/hadoop-hdds/docs/content/feature/Observability.md @@ -69,7 +69,7 @@ Tracing is turned off by default, but can be turned on with `hdds.tracing.enable ``` -Jager client can be configured with environment variables as documented [here](https://github.com/jaegertracing/jaeger-client-java/blob/master/jaeger-core/README.md): +Jaeger client can be configured with environment variables as documented [here](https://github.com/jaegertracing/jaeger-client-java/blob/master/jaeger-core/README.md): For example: diff --git a/hadoop-hdds/docs/content/feature/Observability.zh.md b/hadoop-hdds/docs/content/feature/Observability.zh.md new file mode 100644 index 000000000000..7a5c67b4cdd4 --- /dev/null +++ b/hadoop-hdds/docs/content/feature/Observability.zh.md @@ -0,0 +1,217 @@ +--- +title: "可观察性" +weight: 8 +menu: +main: +parent: 特性 +summary: Ozone 的不同工具来提高可观察性 +--- + + +Ozone 提供了多种工具来获取有关集群当前状态的更多信息。 + +## Prometheus +Ozone 原生支持 Prometheus 集成。所有内部指标(由 Hadoop 指标框架收集)都发布在 `/prom` 的 HTTP 端点下。(例如,在 SCM 的 http://localhost:9876/prom)。 + +Prometheus 端点默认是打开的,但可以通过`hdds.prometheus.endpoint.enabled`配置变量把它关闭。 + +在安全环境中,该页面是用 SPNEGO 认证来保护的,但 Prometheus 不支持这种认证。为了在安全环境中启用监控,可以配置一个特定的认证令牌。 + +`ozone-site.xml` 配置示例: + +```XML + + hdds.prometheus.endpoint.token + putyourtokenhere + +``` + +prometheus 配置示例: +```YAML +scrape_configs: + - job_name: ozone + bearer_token: + metrics_path: /prom + static_configs: + - targets: + - "127.0.0.1:9876" +``` + +## 分布式跟踪 +分布式跟踪可以通过可视化端到端的性能来帮助了解性能瓶颈。 + +Ozone 使用 [jaeger](https://jaegertracing.io) 跟踪库收集跟踪,可以将跟踪数据发送到任何兼容的后端(Zipkin,…)。 + +默认情况下,跟踪功能是关闭的,可以通过 `ozon-site.xml` 的 `hdds.tracing.enabled` 配置变量打开。 + +```XML + + hdds.tracing.enabled + true + +``` + +Jaeger 客户端可以用环境变量进行配置,如[这份](https://github.com/jaegertracing/jaeger-client-java/blob/master/jaeger-core/README.md)文档所述。 + +例如: + +```shell +JAEGER_SAMPLER_PARAM=0.01 +JAEGER_SAMPLER_TYPE=probabilistic +JAEGER_AGENT_HOST=jaeger +``` + +此配置将记录1%的请求,以限制性能开销。有关 Jaeger 抽样的更多信息,请查看[文档](https://www.jaegertracing.io/docs/1.18/sampling/#client-sampling-configuration)。 + +## Ozone Insight +Ozone Insight 是一个用于检查 Ozone 集群当前状态的工具,它可以显示特定组件的日志记录、指标和配置。 + +请使用`ozone insight list`命令检查可用的组件: + +```shell +> ozone insight list + +Available insight points: + + scm.node-manager SCM Datanode management related information. + scm.replica-manager SCM closed container replication manager + scm.event-queue Information about the internal async event delivery + scm.protocol.block-location SCM Block location protocol endpoint + scm.protocol.container-location SCM Container location protocol endpoint + scm.protocol.security SCM Block location protocol endpoint + om.key-manager OM Key Manager + om.protocol.client Ozone Manager RPC endpoint + datanode.pipeline More information about one ratis datanode ring. +``` + +## 配置 + +`ozone insight config` 可以显示与特定组件有关的配置(只支持选定的组件)。 + +```shell +> ozone insight config scm.replica-manager + +Configuration for `scm.replica-manager` (SCM closed container replication manager) + +>>> hdds.scm.replication.thread.interval + default: 300s + current: 300s + +There is a replication monitor thread running inside SCM which takes care of replicating the containers in the cluster. This property is used to configure the interval in which that thread runs. + + +>>> hdds.scm.replication.event.timeout + default: 30m + current: 30m + +Timeout for the container replication/deletion commands sent to datanodes. After this timeout the command will be retried. + +``` + +## 指标 +`ozone insight metrics` 可以显示与特定组件相关的指标(只支持选定的组件)。 +```shell +> ozone insight metrics scm.protocol.block-location +Metrics for `scm.protocol.block-location` (SCM Block location protocol endpoint) + +RPC connections + + Open connections: 0 + Dropped connections: 0 + Received bytes: 1267 + Sent bytes: 2420 + + +RPC queue + + RPC average queue time: 0.0 + RPC call queue length: 0 + + +RPC performance + + RPC processing time average: 0.0 + Number of slow calls: 0 + + +Message type counters + + Number of AllocateScmBlock: ??? + Number of DeleteScmKeyBlocks: ??? + Number of GetScmInfo: ??? + Number of SortDatanodes: ??? +``` + +## 日志 + +`ozone insights logs` 可以连接到所需的服务并显示与一个特定组件相关的DEBUG/TRACE日志。例如,显示RPC消息: + +```shell +>ozone insight logs om.protocol.client + +[OM] 2020-07-28 12:31:49,988 [DEBUG|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] OzoneProtocol ServiceList request is received +[OM] 2020-07-28 12:31:50,095 [DEBUG|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] OzoneProtocol CreateVolume request is received +``` + +使用 `-v` 标志,也可以显示 protobuf 信息的内容(TRACE级别的日志): + +```shell +ozone insight logs -v om.protocol.client + +[OM] 2020-07-28 12:33:28,463 [TRACE|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] [service=OzoneProtocol] [type=CreateVolume] request is received: +cmdType: CreateVolume +traceID: "" +clientId: "client-A31DF5C6ECF2" +createVolumeRequest { + volumeInfo { + adminName: "hadoop" + ownerName: "hadoop" + volume: "vol1" + quotaInBytes: 1152921504606846976 + volumeAcls { + type: USER + name: "hadoop" + rights: "200" + aclScope: ACCESS + } + volumeAcls { + type: GROUP + name: "users" + rights: "200" + aclScope: ACCESS + } + creationTime: 1595939608460 + objectID: 0 + updateID: 0 + modificationTime: 0 + } +} + +[OM] 2020-07-28 12:33:28,474 [TRACE|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] [service=OzoneProtocol] [type=CreateVolume] request is processed. Response: +cmdType: CreateVolume +traceID: "" +success: false +message: "Volume already exists" +status: VOLUME_ALREADY_EXISTS +``` + + \ No newline at end of file