Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Panic of jaeger collector #3885

Closed
Etienne-Carriere opened this issue Aug 24, 2022 · 8 comments · Fixed by #3886
Closed

[Bug]: Panic of jaeger collector #3885

Etienne-Carriere opened this issue Aug 24, 2022 · 8 comments · Fixed by #3886
Labels

Comments

@Etienne-Carriere
Copy link

What happened?

I have a panic of the jaeger-collector

Steps to reproduce

  1. Version of jaeger-collector : {"gitCommit":"fdc3ae5d914daa5c097bf18afbfa62e1c0d4de97","gitVersion":"v1.37.0","buildDate":"2022-08-03T17:38:30Z"} . In my previous version (1.26.0), I also reproduce the bug
  2. Have a configuration with both Kafka with protobuf encoding (no issue with json) + --collector.tags options
  3. It is a random bug (or at least I don't identify precisely the faulting span)
  4. At 1500 spans/s, wait less than 30 minutes (sometimes

Expected behavior

No panic

Relevant log output

panic: runtime error: slice bounds out of range [-6:]
goroutine 123 [running]:
github.com/jaegertracing/jaeger/model.(*Span).MarshalToSizedBuffer(0xc000a2aff0, {0xc000a858c0, 0x29e, 0x29e})
github.com/jaegertracing/jaeger/model/model.pb.go:1246 +0x8c5
github.com/jaegertracing/jaeger/model.(*Span).XXX_Marshal(0xc000433620?, {0xc000a858c0, 0xc000a2aff0?, 0x29e}, 0x90?)
github.com/jaegertracing/jaeger/model/model.pb.go:377 +0x37
github.com/gogo/protobuf/proto.Marshal({0x18e13c0, 0xc000a2aff0})
github.com/gogo/[email protected]/proto/table_marshal.go:2940 +0x144
github.com/jaegertracing/jaeger/plugin/storage/kafka.(*protobufMarshaller).Marshal(0xc00160e2e0?, 0xc000619a30?)
github.com/jaegertracing/jaeger/plugin/storage/kafka/marshaller.go:39 +0x25
github.com/jaegertracing/jaeger/plugin/storage/kafka.(*SpanWriter).WriteSpan(0xc00073af00, {0xc0017d84b0?, 0x0?}, 0xc000a2aff0)
github.com/jaegertracing/jaeger/plugin/storage/kafka/writer.go:77 +0x39
github.com/jaegertracing/jaeger/storage/spanstore.(*CompositeWriter).WriteSpan(0x18e5698?, {0x18e5708, 0xc0017d84b0}, 0x18d3e10?)
github.com/jaegertracing/jaeger/storage/spanstore/composite.go:41 +0x11e
github.com/jaegertracing/jaeger/cmd/collector/app.(*spanProcessor).saveSpan(0xc00039a510, 0xc000a2aff0, {0x0, 0x0})
github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:160 +0xfe
github.com/jaegertracing/jaeger/cmd/collector/app.ChainedProcessSpan.func1(0xc0002aca80?, {0x0, 0x0})
github.com/jaegertracing/jaeger/cmd/collector/app/model_consumer.go:35 +0x4b
github.com/jaegertracing/jaeger/cmd/collector/app.(*spanProcessor).processItemFromQueue(0xc00039a510, 0xc0012f9c80)
github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:191 +0x57
github.com/jaegertracing/jaeger/cmd/collector/app.NewSpanProcessor.func1({0x133d440?, 0xc0012f9c80?})
github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:78 +0x36
github.com/jaegertracing/jaeger/pkg/queue.ConsumerFunc.Consume(0xc000619fa8?, {0x133d440?, 0xc0012f9c80?})
github.com/jaegertracing/jaeger/pkg/queue/bounded_queue.go:105 +0x2c
github.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumersWithFactory.func1()
github.com/jaegertracing/jaeger/pkg/queue/bounded_queue.go:84 +0xd6
created by github.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumersWithFactory
github.com/jaegertracing/jaeger/pkg/queue/bounded_queue.go:74 +0x53

Screenshot

No response

Additional context

  • The collector tags is important
    • Without collector.tags, it doesn't panic
    • With collector.tags, it panics

I dig a bit in the code (by adding some debug) and it seems that it could be linked to

as the sort seems to introduce a corruption in the data structure .

For tests, I recompile without the typedTags.Sort() and I hadn't have panic anymore

Jaeger backend version

1.37.0

SDK

No response

Pipeline

OTEL Shim => OTEL SDK => Jaeger Agent => Jaeger collector => Kafka (protobuf)

Stogage backend

ElasticSearch 7 + Kafka (protobuf)

Operating system

Linux

Deployment model

binaries on KVM VMs

Deployment configs

export SPAN_STORAGE_TYPE=elasticsearch,kafka
export ES_USERNAME
export ES_PASSWORD
export KAFKA_PRODUCER_PLAINTEXT_USERNAME
export KAFKA_PRODUCER_PLAINTEXT_PASSWORD
/home/cloud-user/local/bin/jaeger-collector \
    --sampling.strategies-file=$HOME/local/etc/jaeger/sampling_strategies.json \
    --collector.queue-size=100000 \
    --es.server-urls=$ES_URLS \
    --es.num-replicas=1\
    --es.bulk.workers=10\
    --es.tags-as-fields.all\
    --kafka.producer.brokers $KAFKA_BROKERS\
    --kafka.producer.encoding protobuf\
    --kafka.producer.topic jaeger-spans-protobuf\
    --kafka.producer.authentication plaintext\
    --kafka.producer.plaintext.mechanism SCRAM-SHA-512
    --collector.tags="collector_name=$COLLECTOR_NAME,collector_az=$AZ,collector_region=$REGION"
@yurishkuro
Copy link
Member

Smells like a race condition, but addCollectorTags is called from enqueueSpan called from ProcessSpans which is called once per batch.

However, there's definitely inefficiency in how collector tags are applied, since they are added to Process.Tags, but Process is typically identical across all spans in a batch.

@Etienne-Carriere
Copy link
Author

@yurishkuro , I agree it looks like a race condition but I dig a bit in the code and I don't find/understand where there is parallel operation (write by addCollectorTags / read by protobuf serializer)

Concerning the inefficiency, I will look a bit in the code to see if it is easy to solve but no guarantee

@Etienne-Carriere
Copy link
Author

Do you know why we decide to sort the process tag only when collector.tags option is activated. I tested by removing the sort call and I no more reproduce the issue.

@yurishkuro
Copy link
Member

Are you able to reliably reproduce the issue? It would be good to find the actual issue - applying Sort is just a symptom, I think.

@yurishkuro
Copy link
Member

I don't find/understand where there is parallel operation (write by addCollectorTags / read by protobuf serializer)

Actually, I think that is it - the call to addCollectorTags happens inside enqueueSpan, so one span might be off to a queue (and picked up for processing / serialization) while the other is still going through addCollectorTags. If they both share the Process object, then append and Sort can make changes to the underlying array.

Can you try with #3886 ?

@Etienne-Carriere
Copy link
Author

@yurishkuro , thanks for the reactivity! I will test your patch tomorrow morning. About reliability to reproduce :

  • 1.37.0 without collector_tags : no panic during months
  • 1.37.0 + collector_tags : many panic per hour
    => I will test your patch and if I have no panic during the weekend, it was probably the issue

@yurishkuro
Copy link
Member

👍

@Etienne-Carriere
Copy link
Author

Etienne-Carriere commented Aug 26, 2022

I installed the patched version 5 hours ago and so far so good !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants