[otlp/logrecords]: Ingestion of native OTLP LogRecords blew up memory usage in Distributors and Ingesters.. #13185

diranged · 2024-06-10T11:35:50Z

Describe the bug
We collect ~400-500k LogRecords/s with the otel-collector and are sending them to our Loki system via the lokiexporter today. With that exporter, we include between 15 and 20 stream labels depending on the kind of log record it is. We tried to switch to native OTLP ingestion of the LogRecords themselves - along with the ingestion of the resource attributes as structured metadata, but it absolutely blew up our memory footprint by over 10x across both distributor and ingester pods:

For example, here's the Distributor CPU and Memory Usage

(the experiment started at 10am and ended at ~9PM)

Here is the Ingester CPU and Memory Usage

To Reproduce
Hard to say exactly - but here is the #1 log line we have in our system - by like 50x - so this is really the majority of our data and cardinality:

Here's a standard istio-proxy log line looks like with the standard lokiexporter and us explicitly picking out stream labels:

{
   "requested_server_name":null,
   "route_name":null,
   "upstream_local_address":"100.64.176.55:41486",
   "connection_termination_details":null,
   "protocol":"HTTP/1.1",
   "upstream_service_time":"9",
   "user_agent":null,
   "duration":9,
   "grpc_status":null,
   "authority":"xx.xx.com",
   "start_time":"2024-06-10T11:20:28.687Z",
   "response_code_details":"via_upstream",
   "upstream_cluster":"outbound|80||xxx.xxx.svc.cluster.local",
   "response_code":404,
   "response_flags":"-",
   "upstream_transport_failure_reason":null,
   "downstream_remote_address":"yy.yy.yy.yy:0",
   "authorization":null,
   "upstream_host":"100.64.185.241:80",
   "downstream_local_address":"xx.xx.xx.xx:443",
   "method":"POST",
   "path":"/v2/xxx",
   "bytes_received":179,
   "x_forwarded_for":"85.115.208.156,xx.xx.xx.xx",
   "request_id":"bbe7c561-xxxx-xxxx-xxxx-adfc9f18c45c",
   "bytes_sent":151
}

We turn that into:

env: production
group: xxx
cloud_availability_zone: us-west-2c
cluster: us1
exporter: OTLP
istio_authority: yyy.xxx.com
istio_method: POST
istio_protocol: HTTP/1.1
istio_response_code: 404
istio_response_flags: -
k8s_container_name: istio-proxy
k8s_container_restart_count: 0
k8s_deployment_name: istio-public-gateway
k8s_namespace_name: istio-gateways
k8s_node_name: ip-xx-xx-xx-xx.us-west-2.compute.internal
k8s_pod_name: istio-public-gateway-xxx-mxzrv
k8s_pod_uid: e5d0266b-2f24-xxx-xxx-f4e0b5398e7d
k8s_replicaset_name: istio-public-gateway-xxx
level: info
log_file_path: /hostfs/var/log/pods/istio-gateways_istio-public-gateway-xxx-mxzrv_e5d0266b-2f24-xxx-xxx-f4e0b5398e7d/istio-proxy/0.log
log_iostream: stdout
node_name: ip-xx-xx-xx-xx.us-west-2.compute.internal
service_name: unknown_service

When we run the same logs through the otlphttpexporer though and let Loki pick out the resource attributes in the distributor, we see:

__stream_shard__: 1
env: production
group: xx
cloud_account_id: xxxx
cloud_availability_zone: us-west-2b
cloud_platform: aws_eks
cloud_provider: aws
cloud_region: us-west-2
cluster: us1
container_id: af2ce7cdxxxx83d526fbf7953d23711de71b7e51a1ad5f3c6
istio_authority: xx.xx.com
istio_method: POST
istio_protocol: HTTP/1.1
istio_response_code: 200
istio_response_flags: -
k8s_container_name: istio-proxy
k8s_container_restart_count: 0
k8s_deployment_name: istio-public-gateway
k8s_namespace_name: istio-gateways
k8s_node_name: ip-xx-xx-xx-xx.us-west-2.compute.internal
k8s_node_uid: e7e6e32b-xxxx-xxxx-xxxx-a0eb6c0e69a9
k8s_pod_name: istio-public-gateway-8544b5f59d-qdgfm
k8s_pod_uid: 865903b8-xxxx-xxxx-xxxx-2699ef0a847c
k8s_replicaset_name: istio-public-gateway-8544b5f59d
level: info
log_file_path: /hostfs/var/log/pods/istio-gateways_istio-public-gateway-8544b5f59d-qdgfm_865903b8-xxxx-xxxx-xxxx-2699ef0a847c/istio-proxy/0.log
log_iostream: stdout
logtag: F
node_name: ip-100-xx-xx-xx.us-west-2.compute.internal
observed_timestamp: 1717979321339901431
os_type: linux
service_name: unknown_service
time: 2024-06-10T00:28:41.326710114Z

Expected behavior
I certainly understand that Loki is doing more work now to process the data - and I expect memory to go up.. but I did not expect the aggregate memory usage of the distribuors to from ~6-8Gi to ~80-150Gi:

The ingesters are worse - we went from ~350-500Gi -> 8+TB of memory usage, and it still couldn't keep up:

Environment:

Infrastructure: Kubernetes on AWS EKS using BottleRocket
Deployment tool: ArgoCD / Helm

We would love to take advantage of the new system - but it seems there's some critical tuning to do.. any suggestions?

The text was updated successfully, but these errors were encountered:

JohanLindvall · 2024-06-11T06:37:53Z

Possibly related to #13123?

mveitas · 2024-10-22T00:52:23Z

@diranged Were you able to resolve the issues with the OTLP ingestion or did you stick wtih the lokiexporter?

diranged · 2024-10-23T18:45:47Z

@diranged Were you able to resolve the issues with the OTLP ingestion or did you stick wtih the lokiexporter?

We are stuck on the Loki Exporter for the time being..

diranged mentioned this issue Jun 10, 2024

Distributing Load behind ALB/NLBs open-telemetry/opentelemetry-collector-contrib#33453

Open

JStickler added type/bug Somehing is not working as expected OTEL/OLTP labels Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[otlp/logrecords]: Ingestion of native OTLP LogRecords blew up memory usage in Distributors and Ingesters.. #13185

[otlp/logrecords]: Ingestion of native OTLP LogRecords blew up memory usage in Distributors and Ingesters.. #13185

diranged commented Jun 10, 2024 •

edited

Loading

JohanLindvall commented Jun 11, 2024

mveitas commented Oct 22, 2024

diranged commented Oct 23, 2024

[otlp/logrecords]: Ingestion of native OTLP LogRecords blew up memory usage in Distributors and Ingesters.. #13185

[otlp/logrecords]: Ingestion of native OTLP LogRecords blew up memory usage in Distributors and Ingesters.. #13185

Comments

diranged commented Jun 10, 2024 • edited Loading

JohanLindvall commented Jun 11, 2024

mveitas commented Oct 22, 2024

diranged commented Oct 23, 2024

diranged commented Jun 10, 2024 •

edited

Loading