Skip to content

tracing: fix Datadog span name#29932

Merged
zuercher merged 7 commits intoenvoyproxy:mainfrom
DataDog:david.goffredo/tracing-fix-datadog-span-name
Oct 12, 2023
Merged

tracing: fix Datadog span name#29932
zuercher merged 7 commits intoenvoyproxy:mainfrom
DataDog:david.goffredo/tracing-fix-datadog-span-name

Conversation

@dgoffredo
Copy link
Copy Markdown
Contributor

@dgoffredo dgoffredo commented Oct 3, 2023

Commit Message

tracing: fix Datadog span name

Additional Description

In a series of pull requests beginning last year, I modified Envoy's Datadog tracing extension to use a new library instead of OpenTracing.

Those changes included a bug.

Previously, a Datadog span produced by Envoy had its "operation name" hard-coded to "envoy.proxy". The operation_name parameter to the OpenTracing adapter's startSpan then effectively became the span's "resource name," because of how operation name and resource name were managed by the OpenTracing-based Datadog tracing library.

My changes instead set the span's operation name to the operation_name parameter of startSpan.

Those changes were released with envoy v1.26.0 v1.27.0.

The result is that upgrading Envoy to v1.26 v1.27 changed spans that used to be:

operation name: envoy.proxy
resource name: egress localhost:8080

into instead:

operation name: egress localhost:8080
resource name:

To remedy this, this pull request contains changes that restore the old behavior.

If these changes are accepted, I would appreciate if they would be included in the current minor release series as a bug fix patch.

Risk Level

low

Testing

See the modified unit test.

Also, I ran the sources/extensions/tracers/datadog/demo both with and without these changes. Verified that the produced span's "operation name" before these changes is not as desired. Verified that the produced span's "operation name" after these changes is as desired.

Desired: "Operation name" is "envoy.proxy", and "resource name" is the operation_name passed to startSpan.

Docs Changes

none

Release Notes

Recommended wording:

Fixed a bug in the Datadog tracer where Datadog's "operation name" would contain what should be in the "resource name."

Platform Specific Features

none

Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
Copy link
Copy Markdown
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update changeslogs/current.yaml with your release note. Otherwise this looks good.

@zuercher zuercher self-assigned this Oct 3, 2023
@zuercher zuercher added the backport/review Request to backport to stable releases label Oct 3, 2023
Copy link
Copy Markdown
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add a test in test/extensions/tracers/datadog. I think you could probably check this in the existing ExtractionSuccess test in tracer_test.cc.

Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
@dgoffredo
Copy link
Copy Markdown
Contributor Author

Thanks, @zuercher, that was quick!

I'll update changeslogs/current.yaml like you said. My recent commit should have fixed the unit test.

Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
@dgoffredo
Copy link
Copy Markdown
Contributor Author

Ok, changelog updated.

Once CI accepts this (it might need a poke) and it's approved, I'd like to revisit the question of onto which release series this change will appear. I don't know the process for requesting a patch to the current release series.

Co-authored-by: Stephan Zuercher <zuercher@gmail.com>
Signed-off-by: David Goffredo <dmgoffredo@gmail.com>
zuercher
zuercher previously approved these changes Oct 4, 2023
Copy link
Copy Markdown
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RBE is having a sad, so CI is busted but otherwise this looks good.

@zuercher
Copy link
Copy Markdown
Member

zuercher commented Oct 6, 2023

/retest

@dgoffredo
Copy link
Copy Markdown
Contributor Author

I just pushed some changes that merge recent additions to changelogs/current.yaml with these. Looks like Envoy just had a release.

@dgoffredo
Copy link
Copy Markdown
Contributor Author

dgoffredo commented Oct 11, 2023

I just pushed some changes that merge yet more additions to changelogs/current.yaml with these.

@zuercher Thanks again for your review. How do we move this PR forward?

@zuercher
Copy link
Copy Markdown
Member

/retest

@zuercher
Copy link
Copy Markdown
Member

We need the tests to pass. I restarted the TSAN test and hopefully it will pass.

@zuercher zuercher merged commit c3646f9 into envoyproxy:main Oct 12, 2023
@florianmutter
Copy link
Copy Markdown

Hey, I'm not familiar with envoy releases. Can I see somewhere in which version this fix is or will be available? We wanted to update to Istio 1.19 but the change in span names does block us.

@phlax
Copy link
Copy Markdown
Member

phlax commented Oct 13, 2023

if you look at the commit you can see which branches its on

c3646f9

currently its only on main which it means it hasnt been released anywhere but will be a part of 1.28+

@florianmutter if you want to raise backports for other release branches (1.24-7) i would be happy to review - it would then be included in any backported branches when they release, which is probably quite soon

florianmutter pushed a commit to florianmutter/envoy that referenced this pull request Oct 13, 2023
Backport c3646f9 to 1.26

Additional testing:
Also, I ran the sources/extensions/tracers/datadog/demo both with and without these changes. Verified that the produced span's "operation name" before these changes is not as desired. Verified that the produced span's "operation name" after these changes is as desired.

Desired: "Operation name" is "envoy.proxy", and "resource name" is the operation_name passed to startSpan.

Risk Level: low
Testing: See the modified unit test.
Docs Changes: n/a
Release Notes: updated
Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
@florianmutter
Copy link
Copy Markdown

I created the first backport PR for 1.26 here: #30182

florianmutter pushed a commit to florianmutter/envoy that referenced this pull request Oct 13, 2023
Backport c3646f9 to 1.27

Additional testing:
Also, I ran the sources/extensions/tracers/datadog/demo both with and without these changes. Verified that the produced span's "operation name" before these changes is not as desired. Verified that the produced span's "operation name" after these changes is as desired.

Desired: "Operation name" is "envoy.proxy", and "resource name" is the operation_name passed to startSpan.

Risk Level: low
Testing: See the modified unit test.
Docs Changes: n/a
Release Notes: updated
Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
phlax pushed a commit that referenced this pull request Oct 13, 2023
Backport c3646f9

Additional testing:
Also, I ran the sources/extensions/tracers/datadog/demo both with and without these changes. Verified that the produced span's "operation name" before these changes is not as desired. Verified that the produced span's "operation name" after these changes is as desired.

Desired: "Operation name" is "envoy.proxy", and "resource name" is the operation_name passed to startSpan.

Risk Level: low
Testing: See the modified unit test.
Docs Changes: n/a
Release Notes: updated

Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
phlax pushed a commit that referenced this pull request Oct 13, 2023
Backport c3646f9

Additional testing:
Also, I ran the sources/extensions/tracers/datadog/demo both with and without these changes. Verified that the produced span's "operation name" before these changes is not as desired. Verified that the produced span's "operation name" after these changes is as desired.

Desired: "Operation name" is "envoy.proxy", and "resource name" is the operation_name passed to startSpan.

Risk Level: low
Testing: See the modified unit test.
Docs Changes: n/a
Release Notes: updated

Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
@hiddewie
Copy link
Copy Markdown
Contributor

I tested this fix with Envoy 1.27.2, as part of the resolution of #30235. The operation name has been fixed from ingress to envoy.proxy, but the resource name is still not taken from the tag resource.name like it was in Envoy versions 1.26.3 and earlier.

@dgoffredo
Copy link
Copy Markdown
Contributor Author

@hiddewie Yes, the handling of special tag names was removed in the new Datadog tracing library because it is no longer based on OpenTracing.

However, as far as I can see, "resource.name" is not part of OpenTracing's semantic conventions, and so removing it was a change in Datadog behavior.

I can either restore the special meaning of the "resource.name" tag as another patch, or find another solution.

Out of curiosity, what is your particular use case for setting the "resource.name"?

@hiddewie
Copy link
Copy Markdown
Contributor

OK, thanks, interesting.

Out of curiosity, what is your particular use case for setting the "resource.name"?

We set an internal header in the Envoy configuration responsible for routing the requests (there is Lua involved), to a functional resource name (basically sets of similar paths that have a functional name). This internal header is used in the custom_tags tracing configuration to set the resource.name tag. The resource name is used in the Datadog UI to provide summarized information about similar resources, like request count, latency information and error rate. This information per functional resource gives valuable insights into a subset of all proxied requests.

When there is only a single ingress resource present as part of the envoy.proxy operation, these summaries per functional resource name are not generated by Datadog.

@dgoffredo
Copy link
Copy Markdown
Contributor Author

We set an internal header in the Envoy configuration [...]

Then making "resource.name" work with custom_tags in Envoy is probably the best way to fix this. The operation name can be set to a literal elsewhere in the Envoy configuration, but that's just a literal. Adding a dedicated concept of "resource name" to Envoy is probably not worth it, especially since the term "resource" is used in OpenTelemetry and has a different meaning there than it does here.

I'll prioritize restoring the "resource.name" tag behavior as a bug fix.

@dgoffredo dgoffredo deleted the david.goffredo/tracing-fix-datadog-span-name branch October 24, 2023 17:53
dgoffredo added a commit to DataDog/envoy that referenced this pull request Oct 27, 2023
- If a trace sampling decision is extracted from request headers,
  use it regardless of Envoy's sampling configuration.
- Refactor Tracer::startSpan to avoid the move assignment operator of
  datadog::tracing::Span, which is deleted in newer versions of
  dd-trace-cpp.
- Alter the comments above Tracer::startSpan to more accurately reflect
  changes made previously in envoyproxy#29932.

Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
@florianmutter
Copy link
Copy Markdown

I'm confused. We did update to Istio 1.20.0. Looking at this we do now use envoy 1.28. Looking at

span_config.name = "envoy.proxy";
this should contain the changed span name but we still don't see envoy.proxy being used as span name. Here is a screenshot:
Bildschirmfoto 2023-11-23 um 11 08 23
The same trace with Istio 1.17.5 that includes envoy 1.25 shows the same trace like this:
Bildschirmfoto 2023-11-23 um 11 18 11
Do you have any idea what that could be?

@njegosrailic
Copy link
Copy Markdown

Hi,

I believe the final patch was merged last week for both envoy versions 1.27 and 1.28. These patches should be incorporated into the upcoming Envoy patch builds. Ideally, they will be released prior to the next Istio patch build to ensure their inclusion. I asked yesterday whether this issue has been resolved, but it looks like my question only contributed to further confusion.

My main concern is that DataDog tracing remains partially broken in both Istio versions v1.18 and v1.19. This can be visually identified.

@florianmutter
Copy link
Copy Markdown

@njegosrailic Thank you very much. I was not aware of this.

@dgoffredo
Copy link
Copy Markdown
Contributor Author

dgoffredo commented Nov 28, 2023

Thanks, @njegosrailic.

@florianmutter, it is surprising that the operation name of the span you screenshotted is not as expected.

One way to verify the versions of things is to look at the trace in all of its undecorated glory via the JSON API, i.e. https://<datadog website>.com/api/v1/trace/3585287257395120597. The last part of the path is the trace ID, which is displayed in the UI. The JSON dump of the trace (an array of spans) will contain the _dd.tracer_version tag, as well as a tag added by either Envoy or Istio telling the exact version of Envoy (I forget the name of the tag). This can be helpful for debugging.

In the case of the operation name bug, the precise Envoy version is what is important (the fix did not involve Datadog's core tracing library).

@njegosrailic
Copy link
Copy Markdown

njegosrailic commented Nov 29, 2023

Thanks, @dgoffredo.

Today, we tested the latest 1-19 Istio patch, v1.19.4, and here are the important differences based on our research and the additional information you requested. The above is a trace from Istio 1-19, while below is Istio 1-16.

  1. Trace name
<   "name": "test_app_target_edgegateway_http.test_app.svc.cluster.local_80",
---
>   "name": "envoy.proxy",
  1. Trace resource
<   "resource": "egress istio-test-app-target-edgegateway-http-c3.service.testzone.internal",
<   "resource_hash": "423a0107c5aec597",
---
>   "resource": "test-app-target-edgegateway-http.test-app.svc.cluster.local:80/*",
>   "resource_hash": "647e5ccae10c6628",
  1. Envoy version
<     "version": "envoy f9707e29aa0a36b1430b373d95e6c9abb5deca75/1.27.3-dev/Clean/RELEASE/BoringSSL",
---
>     "version": "envoy ea8f1b7c160f265a3fcfaab33446b8f5de98ca11/1.24.8-dev/Clean/RELEASE/BoringSSL",
  1. Datadog tracer version
<     "_dd.p.dm": "-1",
<     "_dd.tracer_version": "v0.1.8",
---
>     "_dd.tracer_version": "v1.2.1",

@dgoffredo
Copy link
Copy Markdown
Contributor Author

@njegosrailic That's very helpful.

<     "version": "envoy f9707e29aa0a36b1430b373d95e6c9abb5deca75/1.27.3-dev/Clean/RELEASE/BoringSSL",
---
>     "version": "envoy ea8f1b7c160f265a3fcfaab33446b8f5de98ca11/1.24.8-dev/Clean/RELEASE/BoringSSL",

New Istio's Envoy is on top, and old Istio's Envoy is on bottom.

The "version" names a git tree in istio/proxy, not in the upstream Envoy. The tree named by the output of the newer Istio,
f9707e29aa0a36b1430b373d95e6c9abb5deca75 is this. Its WORKSPACE file names the commit of upstream Envoy on which it's based, c658fab86a174fda07961e196edb97f9da9b5a50. That's the commit that fixes the sampling issue, backported onto Envoy's v1.27 release branch. Looking at the tree at that commit, I see that my "operation name" fix is also in that code.

So, the change to operation name shown elsewhere in your diff is unexpected.

Please convey this issue to Datadog Support. They will collect configuration information, and we can take it from there.

@dgoffredo
Copy link
Copy Markdown
Contributor Author

dgoffredo commented Nov 29, 2023

Ah, I found a clue.

Here is some code from the Envoy that is part of the newer Istio (from mine and Njegos's comments above):

span_->set_name(operation);

I subsequently modified the behavior of that function as part of the "resource.name" fix, but that modification is not yet available to Istio.

Perhaps the unexpected "operation name" is due not to span creation, but due to a subsequent call to Span::setOperation performed by Istio. Fixing that would require consuming the "resource.name" fix in addition to the "operation name" fix.

This I could confirm with more specifics about your Istio setup. Please proceed with Datadog Support, and share the new case number with me when you have it.

@florianmutter
Copy link
Copy Markdown

florianmutter commented Nov 30, 2023

@florianmutter, it is surprising that the operation name of the span you screenshotted is not as expected.

One way to verify the versions of things is to look at the trace in all of its undecorated glory via the JSON API, i.e. https://<datadog website>.com/api/v1/trace/3585287257395120597. The last part of the path is the trace ID, which is displayed in the UI. The JSON dump of the trace (an array or spans) will contain the _dd.tracer_version tag, as well as a tag added by either Envoy or Istio telling the exact version of Envoy (I forget the name of the tag). This can be helpful for debugging.

In the case of the operation name bug, the precise Envoy version is what is important (the fix did not involve Datadog's core tracing library).

I attached two spans of a similar trace. I removed all other spans.

The first span 6214739122545778382 is created by the istio gateway. Here the version is envoy 580bf9cc561a012d4eb14af48d496d6748307e59/1.28.1-dev/Clean/RELEASE/BoringSSL and the _dd.tracer_version is 0.1.8.

The second span 1767944568465430022 is created by the istio sidecar of a service. The version is taken from the DD_VERSION variable so we do not see an envoy version. _dd.tracer_version is also 0.1.8. The sidecar runs the same proxyv2 image as the gateway. I don't see a reason why it should use a different version of envoy. We have one version of istio deployed only.

Istio gateway span:

{
  "trace_id": "6214739122545778382",
  "span_id": "6214739122545778382",
  "parent_id": "0",
  "start": 1701244677.028113,
  "end": 1701244677.184877,
  "duration": 0.156763803,
  "type": "web",
  "service": "istio-ingressgateway-iap.istio-system",
  "name": "gateway.instore.svc.cluster.local_8080_api",
  "resource": "egress instore.internal.example.com",
  "resource_hash": "f17c52a6c152467c",
  "host_id": 563956795,
  "hostname": "gke-development-pool3-41f15120-m6ns.europe-west4-a.c.puc-d-ngshop-gke-0228.internal",
  "env": "development",
  "host_groups": [
    "env:development"
  ],
  "meta": {
    "istio.mesh_id": "cluster.local",
    "version": "envoy 580bf9cc561a012d4eb14af48d496d6748307e59/1.28.1-dev/Clean/RELEASE/BoringSSL",
    "http.protocol": "HTTP/2",
    "http.status_code": "200",
    "runtime-id": "afd3478a-2e0a-413b-a8d1-74ae1360d99c",
    "response_flags": "-",
    "http.useragent": "axios/1.6.2",
    "zone": "europe-west4",
    "_dd.p.dm": "-1",
    "process_id": "14",
    "http.url_details.host": "instore.internal.example.com",
    "language": "cpp",
    "istio.namespace": "istio-system",
    "http.method": "POST",
    "upstream_cluster.name": "outbound|8080||gateway.instore.svc.cluster.local",
    "peer.address": "10.239.100.136",
    "istio.canonical_service": "istio-ingressgateway-iap",
    "_dd.agent_hostname": "gke-development-pool3-41f15120-m6ns.europe-west4-a.c.puc-d-ngshop-gke-0228.internal",
    "_dd.agent_version": "7.49.0",
    "http.url_details.path": "/api/servicepoint/stock-info/click-and-reserve",
    "http.url": "https://instore.internal.example.com/api/servicepoint/stock-info/click-and-reserve?tenant=PUC_DE",
    "http.path_group": "/api/servicepoint/stock-info/click-and-reserve",
    "request_size": "139",
    "http.host": "instore.internal.example.com",
    "node_id": "router~10.239.66.6~istio-ingressgateway-iap-867b5b4c79-mzht6.istio-system~istio-system.svc.cluster.local",
    "_dd.agent_rare_sampler.enabled": "false",
    "istio.canonical_revision": "latest",
    "_dd.tracer_version": "v0.1.8",
    "component": "proxy",
    "http.url_details.scheme": "https",
    "downstream_cluster": "-",
    "guid:x-request-id": "52d096ff-0b0c-9df6-a33e-fbd156869bf4",
    "ddtags": "ingestion_reason:auto",
    "http.url_details.queryString.tenant": "PUC_DE"
  },
  "metrics": {
    "_dd.agent_errors_sampler.target_tps": 10.0,
    "_dd.agent_psr": 1.0,
    "_trace_root": 1.0,
    "response_size": 74.0,
    "_sampling_priority_v1": 1.0,
    "_top_level": 1.0,
    "_dd.agent_priority_sampler.target_tps": 10.0
  },
  "ingestion_reason": "auto",
  "metadata": {
    "sds_info": []
  },
  "children_ids": [
    "2310811855767820061"
  ]
}

Istio sidecar span

{
  "trace_id": "6214739122545778382",
  "span_id": "1767944568465430022",
  "parent_id": "5261554865300988373",
  "start": 1701244677.057804,
  "end": 1701244677.179114,
  "duration": 0.121309735,
  "type": "web",
  "service": "fulfill.order",
  "name": "order.fulfill.svc.cluster.local_50051",
  "resource": "ingress",
  "resource_hash": "9498ccf48d93a1bf",
  "host_id": 563547440,
  "hostname": "gke-development-pool3-cd9ee151-7dsz.europe-west4-c.c.puc-d-ngshop-gke-0228.internal",
  "env": "development",
  "host_groups": [
    "env:development"
  ],
  "meta": {
    "upstream_cluster.name": "inbound|50051||",
    "zone": "europe-west4",
    "http.status_code": "200",
    "downstream_cluster": "-",
    "http.useragent": "grpc-java-netty/1.59.0",
    "istio.mesh_id": "cluster.local",
    "http.url_details.host": "order.fulfill",
    "language": "cpp",
    "http.url_details.scheme": "http",
    "_dd.agent_version": "7.49.0",
    "_dd.tracer_version": "v0.1.8",
    "component": "proxy",
    "istio.namespace": "fulfill",
    "http.url_details.path": "/fulfill.order.OrderService/CreateNewRetailOrder",
    "grpc.authority": "order.fulfill:50051",
    "grpc.path": "/fulfill.order.OrderService/CreateNewRetailOrder",
    "version": "rev-d0eb43b93f93612ce20b31bb7ce4e45d83975145",
    "guid:x-request-id": "b10e2e1f-4d3b-939e-bd6a-974d35c31184",
    "istio.canonical_service": "order",
    "node_id": "sidecar~10.239.67.58~order-6cdf99b645-lb582.fulfill~fulfill.svc.cluster.local",
    "grpc.status_code": "0",
    "peer.address": "10.239.67.87",
    "grpc.content_type": "application/grpc",
    "http.protocol": "HTTP/2",
    "http.url": "http://order.fulfill:50051/fulfill.order.OrderService/CreateNewRetailOrder",
    "ddtags": "ingestion_reason:auto",
    "_dd.p.dm": "-1",
    "env": "development",
    "http.url_details.port": "50051",
    "runtime-id": "5193c159-e218-44f1-8750-34d1d454fcd9",
    "http.path_group": "/fulfill.order.OrderService/CreateNewRetailOrder",
    "response_flags": "-",
    "request_size": "347",
    "_dd.agent_hostname": "gke-development-pool3-cd9ee151-7dsz.europe-west4-c.c.puc-d-ngshop-gke-0228.internal",
    "istio.canonical_revision": "latest",
    "process_id": "23",
    "_dd.agent_rare_sampler.enabled": "false",
    "http.host": "order.fulfill",
    "http.method": "POST"
  },
  "metrics": {
    "_top_level": 1.0,
    "_dd.agent_priority_sampler.target_tps": 10.0,
    "response_size": 39.0,
    "_dd.agent_errors_sampler.target_tps": 10.0,
    "_sampling_priority_v1": 1.0
  },
  "ingestion_reason": "auto",
  "metadata": {
    "sds_info": []
  },
  "children_ids": [
    "6764890222525977566"
  ]
}

@phlax phlax removed the backport/review Request to backport to stable releases label Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants