-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distirbution of traces/span amongst collector #1678
Comments
If you are using the Jaeger Agent, you can configure them to use gRPC instead of Thrift ( If your tracers are connecting directly to the collector, only TChannel is supported at the moment, and it's not possible to load balance individual requests. |
we are using agent , will check the configuration . As of now we are managing the jaeger set up and there are different team they are just bombarding the traces and spans. Is there anyway we can do at collector level ? |
@prana24 at Uber we recommend all team to use an internal wrapper for Jaeger client libraries, which makes sure that production services are always using If you have no control over the clients, the brute-force solution is to implement downsampling in the collector (which we do at Uber, but at this point as more of a safety measure). Downsampling is consistently based on trace ID hash, so you don't get partial traces, but downsampling affects all users equally, not just the offending service. Another approach is throttling clients doing sampling, but it's not currently implemented (#1676). The best solution imo is tail-based sampling, which Jaeger does not support yet directly, but you can get it with OpenCensus Service. |
We were using jaeger-agent 1.8.x , i see grpc was probably not enabled in that version . I am upgrading agent to latest ( 1.13.x ) . My collector is still 1.9.x , is this version ok , or i should upgrade that as well ? |
If you can, keep both the collector and the agent at the same version. |
Thank you @jpkrohling , i have done that , i have a basic question about dns:///<service_name>:14250 , what is <service_name> , here it is the same name which we get by command |
It's the DNS name under which the service can be reached. In Kubernetes, this is typically If you are using Kubernetes, I recommend taking a look at the jaeger-operator. Even if you decide not to use it for production, you might benefit from seeing how it deploys Jaeger. |
sure, thank you . I am taking a lookg |
Hi ,
Also pasted here agent.yaml
Any idea what is wrong here ? |
Nothing seems wrong there: gRPC tried to load some extra configuration via DNS but couldn't find anything "extra". As you can see in the following log entries, the connection with the collector was established and is ready:
So, looks like it's working ;-) |
It is working but , i was expecting agent should send traces/span to both the collector , which currently sending to only one . i mean to say it is not load balanced , Am i missing something here ? |
You might not see round-robin load balancing, as gRPC will reuse the same pipe for multiple requests, but one easy way to check that it's working as expected is by killing one of the collectors. If the agent switches over to the remaining collector, the load balancing is working. |
I just checked the gRPC docs, and it seems that it should indeed be doing round-robin balancing:
Source: https://github.com/grpc/grpc/blob/master/doc/load-balancing.md
What do you mean here? The communication between Agent and Collector should be via gRPC, not via TChannel. |
@jkandasa, @kevinearls I think one of you ran some tests for this behavior in the past. Can you spot if there's anything missing here? |
Just to clear confusion , the grafana image which i have posted is the production problem which i want to solve( agent , collector running on 1.8 .x with tchannel) . |
I'm confused now: you are seeing load balanced traffic in production, but not on your dev environment? |
My production version is 1.8 communication with tchannel , and the grafana images are from production env. it shows that load is unbalanced and also drops. I want to check if we move to 1.13 .x with grpc we can solve the problem in production , and that is why i am trying 1.13.1 +grpc in dev ( agent.yaml and log which i shared ) , |
@jpkrohling In openshift, we create an additional service( |
Vow ..!! i cant wait , @jkandasa can you give me more information ? where do i get it ? any reference and topology ? |
@prana24 AFAIK, there is no specific example to create collector headless service. @objectiser can guide here better. I just copied/modified collector service YAML from generated(by jaeger-operator) service file. - apiVersion: v1
kind: Service
metadata:
name: jaeger-collector-headless
labels:
app: jaeger
jaeger-infra: collector-service
spec:
clusterIP: None
ports:
- name: jaeger-collector-grpc
port: 14250
protocol: TCP
targetPort: 14250
selector:
jaeger-infra: collector-pod
type: ClusterIP |
Thanks @jkandasa , i will give it a shot today |
i have the same problem!!! error is but network was right Name: my-jaeger-collector-headless.kube-system |
@pujunYang could you please share what's your |
@jpkrohling yes
|
@jpkrohling it is was Jaeger.yaml apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: my-jaeger
namespace: kube-system
spec:
strategy: production # <1>
allInOne:
image: jaegertracing/all-in-one:latest # <2>
options: # <3>
log-level: debug # <4>
storage:
type: elasticsearch # <5>
options: # <6>
es: # <7>
server-urls: http://elasticsearch-logging:9200
tls:
skip-host-verify: true
ingress:
enabled: false # <8>
agent:
strategy: DaemonSet # <9>
annotations:
scheduler.alpha.kubernetes.io/critical-pod: "" # <10> |
I have another concerned about load balancing; We use The problem; It seems it only resolve the list when the agent starts? |
Which version of Jaeger are you using, @parberge? We've bumped the gRPC client version in v1.20.0 which was recently released, and I know it has some improvements in this area, although I'm not 100% sure this case is covered. When fixing #2443, I remember reading that the gRPC client will get a new list of clients only when it runs out of healthy connections, but hopefully this newest gRPC client is smarter. |
I'm closing, as I think this has been answered some time ago, but feel free to reopen if there are still questions. |
Not 1.20 that's for sure. Will test and create an issue if the problem remains. Thanks. |
Hey guys - I am seeing some of the same issues. On v.1.19 right now, so will try to do the upgrade. But much like some of the folks are seeing. I am using the |
Haven't upgraded yet 😑
Den tors 15 okt. 2020 02:27Josh Kierpiec <[email protected]> skrev:
… Hey guys - I am seeing some of the same issues. On v.1.19 right now, so
will try to do the upgrade. But much like some of the folks are seeing. I
am using the jaeger-operator, have the HPA setup for min/max of 2/10, and
when CPU gets hammered during our bot/soak tests the collectors scale up as
expected, but the agent connections continue to fire spans down their
already existing connections. So effectively, feels like more of a fault
tolerance setup than a high availability one. @parberge
<https://github.com/parberge> before I go super deep, did you see any
positive changes with v1.20.0?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1678 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJQPWI4QULKBMK4ZUWLRWDSKY6WJANCNFSM4IGGIAVA>
.
|
Feel free to reopen this issue you see the same problem happening on v1.20. |
Hi folks, unfortunately I see the same behavior as before. Running 1.20.0 container and agent (see below) via jaeger-operator. Execute our bots firing off spans/traces and can observe the following in our graphs. You can see below we scaled to 4 collector instances, but the agents have no knowledge that they should reconnect and continue to saturate the collectors they are already connected to. The situation makes sense - I'm not missing a configuration in anyway for the collectors to notify the agent they should reconnect when dropping spans?
@jpkrohling will try to reopen, need to figure out how =) |
I'll check what we can do, but I think the gRPC client might need some time to update the list of backends. In earlier versions, it would update only if all known backends were failing. |
OK - I am letting this soak. This may be something unique to how our bots are running also, as they are being spun up asynchronously in a single service, so would make sense that it would send traffic a single agent and thus overload the collector its connected to. Is the agent designed to have a single connection to a collector at a given point in time? If thats the case, this MAY be OK for us in production when we have bots replaced with real traffic and getting load balanced across our edge service, thus distributing across the agents more naturally. |
I'd have to double-check with the gRPC client load balancer documentation, but I think that's indeed the case. The agent has a list of backends, but will only failover once its "current" backend fails. |
@jkandasa do you remember from your load-tests what's the expected behavior here? |
I encountered the same issue when using Opentelemetry collector with headless Jaeger gRPC collector in Kubernetes. Here is my configurations: Opentelemetry collector: exporters:
jaeger:
endpoint: "dns:///jaeger-collector-svc.monitoring.svc.cluster.local:14250"
balancer_name: "round_robin"
insecure: true
logging:
loglevel: info
extensions:
health_check: {}
processors:
batch: {}
receivers:
otlp:
protocols:
grpc: {}
http: {}
jaeger:
protocols:
thrift_compact: {}
thrift_http: {}
extensions:
health_check: {}
service:
extensions:
- health_check
pipelines:
traces:
exporters:
- logging
- jaeger
processors:
- batch
receivers:
- otlp
- jaeger Jaeger collector headless service in Kubernetes: apiVersion: v1
kind: Service
metadata:
annotations:
labels:
app: jaeger-collector
name: jaeger-collector-svc
namespace: monitoring
spec:
clusterIP: None
ports:
- name: admin
port: 14269
protocol: TCP
targetPort: 14269
- name: receive-span-from-jaeger-agent
port: 14250
protocol: TCP
targetPort: 14250
- name: receive-span-from-jaeger-client
port: 14268
protocol: TCP
targetPort: 14268
selector:
app: jaeger-collector
sessionAffinity: None
type: ClusterIP |
@JMCFTW, this is something to be checked and handled at the OpenTelemetry Collector side of things. I just created an issue there (open-telemetry/opentelemetry-collector#4274) and assigned it to myself. |
Hi @jpkrohling, Thanks for referencing this issue in Opentelemetry collector. I'm not sure this issue can be handled in Opentelemetry collector or not, because it seems like gRPC client parameters doesn't have an option can let client(Opentelemetry collector) to do DNS name re-resolution after server(Jaeger collector) is auto scaling out/down. So in my opinion, a possible workarounds is to let Since I'm not investigate this issue very long time so please feel free to correct me if I'm wrong or have misunderstood something. |
That's a good hint, thanks! I think I faced a similar issue before, and if a fix is needed here on the Jaeger side of things, I'll fix it here. |
Just as a status update, I'm able to reproduce this. Reading some source code from gRPC Go, I was expecting the DNS resolution to happen every 30s, adding the new backends to the list and making them available as subchannels, but looks like it's not happening. I'll check a couple of things, and if they don't work, I'll give the The following screenshot shows a situation that started with 10 replicas and later scaled to 20 replicas, expecting the new ones to eventually start receiving traffic. The disparity of two of the numbers is because I had a wrong configuration. The remaining 8 similar numbers are after adjusting the config to take advantage of both. This is the config used:
|
Hi @jpkrohling , thanks for adding flags! So according to RELEASE.md, this change will be released at 5 January 2022 right? |
Correct. If you want to test this change before that, I can tag and generate a container image based on the current main. |
Requirement - what kind of business use case are you trying to solve?
Are collector load balanced ?
Problem - what in Jaeger blocks you from solving the requirement?
We have our jaegertracing setup working with back end configured as elastic search. Currently we have two collector replica set up . There are 5-10 services which sends traces to the collector ( the number of services , keep changing ) . I see collectors are not evenly loaded with traffic. One collector reaches to the max queue usage where as other collector is hardly using 20-30% capacity . This causes the drop from the collector which is loaded to the capacity .
Can we load balance the traffic (spans) amongst the both collector ? I am not sure if there is any config and i am missing it.
Proposal - what do you suggest to solve the problem or improve the existing situation?
Any open questions to address
The text was updated successfully, but these errors were encountered: