Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC client + botocore + datadog exporter instrumentation error #262

Closed
alertedsnake opened this issue Dec 17, 2020 · 18 comments · Fixed by #368
Closed

gRPC client + botocore + datadog exporter instrumentation error #262

alertedsnake opened this issue Dec 17, 2020 · 18 comments · Fixed by #368
Labels
bug Something isn't working

Comments

@alertedsnake
Copy link
Contributor

This one is a bit complex - when using the gRPC client instrumentation to make a call to an instrumented gRPC service that then makes a call via boto to the DynamoDB API, the combination of instrumentation leads to this error:

2020-12-17 14:19:39,666 DEBUG    [botocore.httpsession:298] Exception received when sending urllib3 HTTP request
Traceback (most recent call last):
  File "/home/michael/work/myapp/.venv/lib/python3.8/site-packages/botocore/httpsession.py", line 254, in send
    urllib_response = conn.urlopen(
  File "/home/michael/work/myapp/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/home/michael/work/myapp/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/michael/work/myapp/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib64/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/michael/work/myapp/.venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 91, in _send_request
    rval = super(AWSConnection, self)._send_request(
  File "/usr/lib64/python3.8/http/client.py", line 1296, in _send_request
    self.putheader(hdr, value)
  File "/home/michael/work/myapp/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 219, in putheader
    _HTTPConnection.putheader(self, header, *values)
  File "/usr/lib64/python3.8/http/client.py", line 1232, in putheader
    if _is_illegal_header_value(values[i]):
TypeError: expected string or bytes-like object

Without either client instrumentation, this call works perfectly.

Steps to reproduce
As described above - I don't have a simplified sample, but I'll try to produce one.

What is the expected behavior?
No exception :)

What is the actual behavior?
The above exception.

Additional context

I'm definitely working on this, but any suggestions would certainly be welcome, I'm not quite sure what the invalid header is or where it might get set.

@alertedsnake alertedsnake added the bug Something isn't working label Dec 17, 2020
alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Dec 18, 2020
Occasionally not all of the headers available to botocore are strings.
It does allow this, using the `add_header` method rather than
`__setitem__`.  According to the docs for `botocore.compat.HTTPHeaders`:

```
If a parameter value contains non-ASCII characters it can be specified
as a three-tuple of (charset, language, value), in which case it will be
encoded according to RFC2231 rules.  Otherwise it will be encoded using
the utf-8 charset and a language of ''
```

Whereas using `__setitem__` does not do this conversion.

Fixes open-telemetry#262
@toumorokoshi
Copy link
Member

Hi! I think some more information would be helpful there. Specifically:

  1. what is the value of (values[i]) in the last executed statement? it might be good to see precisely why type it is.
  2. what is the header key (the "header" variable being passed into putheader)? This would confirm that the header field is indeed one that is added by opentelemetry.
  3. what is the propagator you are using? e.g. is this the Xray propagator, b3 propagator, or some other format? Each one has different values it injects into the header, so that would be a huge hint as to what's happening here.

@alertedsnake
Copy link
Contributor Author

Those are all things I was wondering too - all tests I tried so far ended up showing only strings here, but there was still an exception being thrown at some point, and just printing the type(var) before the line I fixed in the associated PR never showed anything but a string.

I was mostly using the Datadog exporter and propagator, but I've also been testing with Jaeger and had the same result. My guess was not the propagator but the gRPC client adding a bad header - I've been working on that (see #269) but don't see anything there that'd be strange.

Do you have any suggestions on how to get the information you're looking for? Absolutely happy to keep debugging, but I'm not sure where to start.

@toumorokoshi
Copy link
Member

Do you have any suggestions on how to get the information you're looking for? Absolutely happy to keep debugging, but I'm not sure where to start.

Can this be reproduced locally? if so, you can add a pdb statement or add print statements to the code path in the traceback (usually in site-packages/...)

If this is only reproduced deployed somewhere, You could potentially add some additional log / print statements in your application code. In your instrumented grpc service, before you call dynamodb, you can do:

import opentelemetry

# in your function
span = opentelemetry.get_current_span()
print(span.to_json()) # using this vs the span print itself, as that calls a more minimal __repr__ function.

That will at least print all of the state of the span, which we can use to look through the propagator code to see how that would be converted to a header value.

I was mostly using the Datadog exporter and propagator, but I've also been testing with Jaeger and had the same result. My guess was not the propagator but the gRPC client adding a bad header - I've been working on that (see #269) but don't see anything there that'd be strange.

Got it. taking a quick look through those, I don't see a code path that would leave to something invalid. I agree that maybe something funky is being introduced.

Curious: have you also tried this with opentelemetry disabled / removed? wondering if this is an external issue. The traceback is fairly generic with regards to just adding headers.

@alertedsnake
Copy link
Contributor Author

Oh I didn't think of printing the span out, let me try that. I'm running the two services locally, so I can hack whatever I need to add debugging. I did try a bunch of printing, but couldn't get anything useful before the exception was raised.

Curious: have you also tried this with opentelemetry disabled / removed? wondering if this is an external issue. The traceback is fairly generic with regards to just adding headers

The actual code is stuff we use in production, I've just been trying to add opentelemetry to it, but keep running into parts that aren't working well or unimplemented (see my other PRs) - so I know the code itself works without instrumentation, and as I said in the initial report, leaving off either the gRPC client instrumentation or the botocore instrumentation in the target server makes the bug go away. It's definitely a quirk of this specific combination, I'm still betting the gRPC client somehow, but I'm just not seeing it yet.

Thanks for the suggestion - I'll keep digging!

@toumorokoshi
Copy link
Member

toumorokoshi commented Dec 20, 2020

I'm running the two services locally, so I can hack whatever I need to add debugging. I did try a bunch of printing, but couldn't get anything useful before the exception was raised.

Have you tried adding a pdb statement to this snippet, within a try except?

  File "/usr/lib64/python3.8/http/client.py", line 1232, in putheader
    if _is_illegal_header_value(values[i])
try: 
   if is_illegal_header_value(...)
except:
   import pdb;
   pdb.set_trace()

You should be able to extract all the information you need from that pdb session. prints work too, just a bit more time consuming.

@alertedsnake
Copy link
Contributor Author

So here I am, going back to work on this issue.... and I'm having trouble recreating it. I'm working with a branch pulled off master with the other PRs I've got open (#260, #261, #269) with fixes on gRPC stuff merged in and thus far I haven't seen it. So maybe it was because I had a bit of a convoluted branch in order to continue to make progress while PRs were still open.

I'll do some more testing just to be sure.

@toumorokoshi
Copy link
Member

Sounds good! Feel free to update whenever you have something (or not).

@alertedsnake
Copy link
Contributor Author

I figured out the root cause - thanks @toumorokoshi - that pdb clue was what I needed (I rarely use it). I'm using the Datadog exporter and the headers are coming from its propagator. So that's where I'm investigating now - I'll close the PR though, it's clearly not the answer.

@alertedsnake
Copy link
Contributor Author

@alertedsnake
Copy link
Contributor Author

Yeah that's the best I've got - the span looks like this:

{
   "name": "blah",
    "context": {
        "trace_id": "some id",
        "span_id": "some id",
        "trace_state": "{'_dd_origin': None}"
    },
    (other normal-lookin' stuff)
}

If I skip setting that value on the propagator's carrier, everything is fine. Which probably makes reasonable sense here - it's clearly None so there's no reason to set it. I'll open a PR to do this, it's a one-line change.

alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Dec 22, 2020
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Dec 23, 2020
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Feb 4, 2021
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Feb 12, 2021
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Feb 14, 2021
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
@alertedsnake
Copy link
Contributor Author

Been digging a lot. I think I've narrowed down this issue, but I'm not entirely sure what the intent is.
It's here:
https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/exporter/opentelemetry-exporter-datadog/src/opentelemetry/exporter/datadog/propagator.py#L56

My test case (which duplicates how I use this in production) is all gRPC-based: basically a client which makes a call on a service which then makes its own call to another service. That second call is going to have trace info from the first, and that's where the issue happens. The value of origin on this line is None.

Seems that this code comes from this PR: open-telemetry/opentelemetry-python#705 and there was much discussion about it. Can someone explain what this is to do?

A better question though might be what we should do on line 73 when this is None, since it triggers this error message:

WARNING:opentelemetry.trace.span:Invalid key/value pair (dd_origin, ) found.

I believe this is the root of the issue I've been having here. Thoughts?

@alertedsnake alertedsnake changed the title gRPC client + botocore instrumentation error gRPC client + botocore + datadog exporter instrumentation error Feb 16, 2021
@alertedsnake
Copy link
Contributor Author

Ah one more thing I should point out - the "another service" I mentioned earlier does a DynamoDB call get operation, and is using the botocore instrumentation. I just realized I should probably see if there's anything weird about that component, perhaps something isn't being set right there instead.

@toumorokoshi
Copy link
Member

Thanks for the digging! I think you did find an issue.

A better question though might be what we should do on line 73 when this is None, since it triggers this error message:

I believe that really is the crux here: we just need an if condition to add that field.

TraceState should be a Dict[str,str], so it's a logic mistake to include that in the tracestate.

I'm took a look through our AWS instrumentations and I don't see any logic that would imply that TraceState would be serialized by boto anyway, but regardless that DataDog propagator should be fixed.

The best thing would be to get a locals() dump of a pdb statement right before:

  File "/home/michael/work/myapp/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 219, in putheader
    _HTTPConnection.putheader(self, header, *values)

There's something being added to the header there.

@alertedsnake
Copy link
Contributor Author

alertedsnake commented Feb 23, 2021

Not sure this is very useful:

locals: {'self': <botocore.awsrequest.AWSHTTPConnection object at 0x7f4b79cec5b0>, 'header': 'Host', 'values': ('localhost:8000',)}
locals: {'self': <botocore.awsrequest.AWSHTTPConnection object at 0x7f4b79cec5b0>, 'header': 'Accept-Encoding', 'values': ('identity',)}
locals: {'self': <botocore.awsrequest.AWSHTTPConnection object at 0x7f4b79cec5b0>, 'header': 'X-Amz-Target', 'values': (b'DynamoDB_20120810.GetItem',)}
locals: {'self': <botocore.awsrequest.AWSHTTPConnection object at 0x7f4b79cec5b0>, 'header': 'Content-Type', 'values': (b'application/x-amz-json-1.0',)}
locals: {'self': <botocore.awsrequest.AWSHTTPConnection object at 0x7f4b79cec5b0>, 'header': 'User-Agent', 'values': (b'Botocore/1.20.7 Python/3.9.1 Linux/5.10.16-200.fc33.x86_64',)}
locals: {'self': <botocore.awsrequest.AWSHTTPConnection object at 0x7f4b79cec5b0>, 'header': 'X-Amz-Date', 'values': (b'20210223T015205Z',)}
locals: {'self': <botocore.awsrequest.AWSHTTPConnection object at 0x7f4b79cec5b0>, 'header': 'Authorization', 'values': (b'AWS4-HMAC-SHA256 Credential=AKIAredacted/20210223/us-east-1/dynamodb/aws4_request, SignedHeaders=content-type;host;x-amz-date;x-amz-target, Signature=redacted',)}
locals: {'self': <botocore.awsrequest.AWSHTTPConnection object at 0x7f4b79cec5b0>, 'header': 'Content-Length', 'values': ('114',)}

@toumorokoshi
Copy link
Member

toumorokoshi commented Mar 2, 2021

Not sure this is very useful:

Looking through those, it;s unlikely that any of those variables caused the exception that you're encountering.

Note the traceback:

  File "/home/michael/work/myapp/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 219, in putheader
    _HTTPConnection.putheader(self, header, *values)
  File "/usr/lib64/python3.8/http/client.py", line 1232, in putheader
    if _is_illegal_header_value(values[i]):
TypeError: expected string or bytes-like object

Something in the "values" list is neither a string, nor a bytes object.

I would just throw a try/except around that line in the urllib3/connect.py to print the variables, or the pdb statement so you can look:

try:
        _HTTPConnection.putheader(self, header, *values)
except TypeError:
         print(locals())
         print(header)
         print(values)

alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Mar 2, 2021
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
@alertedsnake
Copy link
Contributor Author

Ok let me catch you up to current:

At this point, with the latest main code from both repos, there's no exception thrown, but instead there's just a warning message which I described in this previous message: #262 (comment)

WARNING:opentelemetry.trace.span:Invalid key/value pair (dd_origin, ) found.

The code which causes this is from this PR: open-telemetry/opentelemetry-python#705 .

I'm pretty sure the value being set there is actually None, but as I said in an earlier message, I don't know the point of the code in #705 there and so I'm not sure if this is okay or not - seems not, since I wouldn't think a warning like this should be logged by the library code.

I'm going to close the PR I made for this, since it doesn't fix anything at this point, but there IS still a bug here, I'm just not sure where it is yet, perhaps @majorgreys might know more?

@toumorokoshi
Copy link
Member

Hey @alertedsnake! I've filed a PR for the datadog issue: it's not causing the problem you're seeing with boto: TraceState will not accept a "None" value, and the result is that constants.DD_ORIGIN is not in the TraceState.

For understanding the boto error, I'd refer to my suggestion above.

@alertedsnake
Copy link
Contributor Author

That looks a lot like the PR for this which I closed because it didn't fix the problem at the source, it just prevented it from propagating:

#272

If you guys think this is the best way to fix it, I guess that works for me.

alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Apr 7, 2021
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
alertedsnake pushed a commit to alertedsnake/opentelemetry-python-contrib that referenced this issue Apr 14, 2021
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment