Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[otlp/logrecords]: Ingestion of native OTLP LogRecords blew up memory usage in Distributors and Ingesters.. #13185

Open
diranged opened this issue Jun 10, 2024 · 3 comments
Labels
OTEL/OLTP type/bug Somehing is not working as expected

Comments

@diranged
Copy link

diranged commented Jun 10, 2024

Describe the bug
We collect ~400-500k LogRecords/s with the otel-collector and are sending them to our Loki system via the lokiexporter today. With that exporter, we include between 15 and 20 stream labels depending on the kind of log record it is. We tried to switch to native OTLP ingestion of the LogRecords themselves - along with the ingestion of the resource attributes as structured metadata, but it absolutely blew up our memory footprint by over 10x across both distributor and ingester pods:

For example, here's the Distributor CPU and Memory Usage

(the experiment started at 10am and ended at ~9PM)
image

Here is the Ingester CPU and Memory Usage
image

To Reproduce
Hard to say exactly - but here is the #1 log line we have in our system - by like 50x - so this is really the majority of our data and cardinality:

Here's a standard istio-proxy log line looks like with the standard lokiexporter and us explicitly picking out stream labels:

{
   "requested_server_name":null,
   "route_name":null,
   "upstream_local_address":"100.64.176.55:41486",
   "connection_termination_details":null,
   "protocol":"HTTP/1.1",
   "upstream_service_time":"9",
   "user_agent":null,
   "duration":9,
   "grpc_status":null,
   "authority":"xx.xx.com",
   "start_time":"2024-06-10T11:20:28.687Z",
   "response_code_details":"via_upstream",
   "upstream_cluster":"outbound|80||xxx.xxx.svc.cluster.local",
   "response_code":404,
   "response_flags":"-",
   "upstream_transport_failure_reason":null,
   "downstream_remote_address":"yy.yy.yy.yy:0",
   "authorization":null,
   "upstream_host":"100.64.185.241:80",
   "downstream_local_address":"xx.xx.xx.xx:443",
   "method":"POST",
   "path":"/v2/xxx",
   "bytes_received":179,
   "x_forwarded_for":"85.115.208.156,xx.xx.xx.xx",
   "request_id":"bbe7c561-xxxx-xxxx-xxxx-adfc9f18c45c",
   "bytes_sent":151
}

We turn that into:

env: production
group: xxx
cloud_availability_zone: us-west-2c
cluster: us1
exporter: OTLP
istio_authority: yyy.xxx.com
istio_method: POST
istio_protocol: HTTP/1.1
istio_response_code: 404
istio_response_flags: -
k8s_container_name: istio-proxy
k8s_container_restart_count: 0
k8s_deployment_name: istio-public-gateway
k8s_namespace_name: istio-gateways
k8s_node_name: ip-xx-xx-xx-xx.us-west-2.compute.internal
k8s_pod_name: istio-public-gateway-xxx-mxzrv
k8s_pod_uid: e5d0266b-2f24-xxx-xxx-f4e0b5398e7d
k8s_replicaset_name: istio-public-gateway-xxx
level: info
log_file_path: /hostfs/var/log/pods/istio-gateways_istio-public-gateway-xxx-mxzrv_e5d0266b-2f24-xxx-xxx-f4e0b5398e7d/istio-proxy/0.log
log_iostream: stdout
node_name: ip-xx-xx-xx-xx.us-west-2.compute.internal
service_name: unknown_service

When we run the same logs through the otlphttpexporer though and let Loki pick out the resource attributes in the distributor, we see:

__stream_shard__: 1
env: production
group: xx
cloud_account_id: xxxx
cloud_availability_zone: us-west-2b
cloud_platform: aws_eks
cloud_provider: aws
cloud_region: us-west-2
cluster: us1
container_id: af2ce7cdxxxx83d526fbf7953d23711de71b7e51a1ad5f3c6
istio_authority: xx.xx.com
istio_method: POST
istio_protocol: HTTP/1.1
istio_response_code: 200
istio_response_flags: -
k8s_container_name: istio-proxy
k8s_container_restart_count: 0
k8s_deployment_name: istio-public-gateway
k8s_namespace_name: istio-gateways
k8s_node_name: ip-xx-xx-xx-xx.us-west-2.compute.internal
k8s_node_uid: e7e6e32b-xxxx-xxxx-xxxx-a0eb6c0e69a9
k8s_pod_name: istio-public-gateway-8544b5f59d-qdgfm
k8s_pod_uid: 865903b8-xxxx-xxxx-xxxx-2699ef0a847c
k8s_replicaset_name: istio-public-gateway-8544b5f59d
level: info
log_file_path: /hostfs/var/log/pods/istio-gateways_istio-public-gateway-8544b5f59d-qdgfm_865903b8-xxxx-xxxx-xxxx-2699ef0a847c/istio-proxy/0.log
log_iostream: stdout
logtag: F
node_name: ip-100-xx-xx-xx.us-west-2.compute.internal
observed_timestamp: 1717979321339901431
os_type: linux
service_name: unknown_service
time: 2024-06-10T00:28:41.326710114Z

Expected behavior
I certainly understand that Loki is doing more work now to process the data - and I expect memory to go up.. but I did not expect the aggregate memory usage of the distribuors to from ~6-8Gi to ~80-150Gi:

image

The ingesters are worse - we went from ~350-500Gi -> 8+TB of memory usage, and it still couldn't keep up:
image

Environment:

  • Infrastructure: Kubernetes on AWS EKS using BottleRocket
  • Deployment tool: ArgoCD / Helm

We would love to take advantage of the new system - but it seems there's some critical tuning to do.. any suggestions?

@JohanLindvall
Copy link
Contributor

Possibly related to #13123?

@mveitas
Copy link
Contributor

mveitas commented Oct 22, 2024

@diranged Were you able to resolve the issues with the OTLP ingestion or did you stick wtih the lokiexporter?

@diranged
Copy link
Author

@diranged Were you able to resolve the issues with the OTLP ingestion or did you stick wtih the lokiexporter?

We are stuck on the Loki Exporter for the time being..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OTEL/OLTP type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

4 participants