Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to send traces to Datadog Agent - Time out #1413

Closed
weisurya opened this issue May 7, 2020 · 9 comments · Fixed by #2459
Closed

Failed to send traces to Datadog Agent - Time out #1413

weisurya opened this issue May 7, 2020 · 9 comments · Fixed by #2459

Comments

@weisurya
Copy link

weisurya commented May 7, 2020

Which version of dd-trace-py are you using?

0.37.0

Which version of the libraries are you using?

aiohttp==3.6.2
amqp==2.5.0
aniso8601==4.1.0
astroid==2.2.5
async-timeout==3.0.1
atomicwrites==1.3.0
attrs==18.2.0
billiard==3.6.0.0
blinker==1.4
boto3==1.9.195
botocore==1.12.195
certifi==2019.3.9
cfgv==3.1.0
chardet==3.0.4
Click==7.0
codacy-coverage==1.3.11
colorama==0.4.1
coverage==4.5.2
ddtrace==0.37.0
docutils==0.14
filelock==3.0.12
Flask==1.0.2
Flask-OpenTracing==0.2.0
Flask-RESTful==0.3.7
funcsigs==1.0.2
gunicorn==19.9.0
identify==1.4.15
idna==2.8
importlib-metadata==1.6.0
intervaltree==3.0.2
isort==4.3.21
itsdangerous==1.1.0
jaeger-client==3.13.0
Jinja2==2.10
jmespath==0.9.4
joblib==0.13.1
jsonschema==3.0.1
kombu==4.6.3
lazy-object-proxy==1.4.1
logdna==1.2.8
MarkupSafe==1.1.0
mccabe==0.6.1
more-itertools==5.0.0
msgpack==1.0.0
multidict==4.7.5
nodeenv==1.3.5
numpy==1.16.1
opentracing==1.3.0
pamqp==2.3.0
pandas==0.24.2
pep8==1.7.1
pika==1.0.1
pluggy==0.8.1
pre-commit==2.3.0
protobuf==3.10.0
psutil==5.6.3
psycopg2==2.8.4
py==1.7.0
pycodestyle==2.5.0
pylint==2.3.1
pyrsistent==0.15.3
pytest==4.2.0
pytest-cov==2.6.1
python-dateutil==2.8.0
python-dotenv==0.10.3
pytz==2018.9
PyYAML==5.1.2
rabbitpy==2.0.1
raven==6.10.0
ray==0.7.5
redis==3.3.11
requests==2.21.0
s3transfer==0.2.1
scikit-learn==0.20.3
scipy==1.2.1
sentry-sdk==0.12.0
setproctitle==1.1.10
six==1.12.0
slackclient==2.5.0
sortedcontainers==2.1.0
threadloop==1.0.2
thrift==0.11.0
toml==0.10.0
tornado==4.5.3
typed-ast==1.4.0
urllib3==1.24.1
vine==1.3.0
virtualenv==16.4.3
websocket-client==0.54.0
Werkzeug==0.14.1
wrapt==1.11.2
yarl==1.4.2
zipp==3.1.0

You can copy/paste the output of pip freeze here.

aiohttp==3.6.2
amqp==2.5.0
aniso8601==4.1.0
astroid==2.2.5
async-timeout==3.0.1
atomicwrites==1.3.0
attrs==18.2.0
billiard==3.6.0.0
blinker==1.4
boto3==1.9.195
botocore==1.12.195
certifi==2019.3.9
cfgv==3.1.0
chardet==3.0.4
Click==7.0
codacy-coverage==1.3.11
colorama==0.4.1
coverage==4.5.2
ddtrace==0.37.0
docutils==0.14
filelock==3.0.12
Flask==1.0.2
Flask-OpenTracing==0.2.0
Flask-RESTful==0.3.7
funcsigs==1.0.2
gunicorn==19.9.0
identify==1.4.15
idna==2.8
importlib-metadata==1.6.0
intervaltree==3.0.2
isort==4.3.21
itsdangerous==1.1.0
jaeger-client==3.13.0
Jinja2==2.10
jmespath==0.9.4
joblib==0.13.1
jsonschema==3.0.1
kombu==4.6.3
lazy-object-proxy==1.4.1
logdna==1.2.8
MarkupSafe==1.1.0
mccabe==0.6.1
more-itertools==5.0.0
msgpack==1.0.0
multidict==4.7.5
nodeenv==1.3.5
numpy==1.16.1
opentracing==1.3.0
pamqp==2.3.0
pandas==0.24.2
pep8==1.7.1
pika==1.0.1
pluggy==0.8.1
pre-commit==2.3.0
protobuf==3.10.0
psutil==5.6.3
psycopg2==2.8.4
py==1.7.0
pycodestyle==2.5.0
pylint==2.3.1
pyrsistent==0.15.3
pytest==4.2.0
pytest-cov==2.6.1
python-dateutil==2.8.0
python-dotenv==0.10.3
pytz==2018.9
PyYAML==5.1.2
rabbitpy==2.0.1
raven==6.10.0
ray==0.7.5
redis==3.3.11
requests==2.21.0
s3transfer==0.2.1
scikit-learn==0.20.3
scipy==1.2.1
sentry-sdk==0.12.0
setproctitle==1.1.10
six==1.12.0
slackclient==2.5.0
sortedcontainers==2.1.0
threadloop==1.0.2
thrift==0.11.0
toml==0.10.0
tornado==4.5.3
typed-ast==1.4.0
urllib3==1.24.1
vine==1.3.0
virtualenv==16.4.3
websocket-client==0.54.0
Werkzeug==0.14.1
wrapt==1.11.2
yarl==1.4.2
zipp==3.1.0

How can we reproduce your problem?

Here is how I initiate the project with ddtrace
ddtrace-run gunicorn index:app

and on the system env, I customized these variables

DATADOG_SERVICE_NAME=<custom service name>
DATADOG_TRACE_AGENT_HOSTNAME=<dedicated hostname>
DATADOG_TRACE_AGENT_PORT=<port number>
DATADOG_ENV=<name of environment>
DATADOG_TRACE_ENABLED=true

What is the result that you get?

Failed to send traces to Datadog Agent at <ddtrace.api.API object at 0x7f20a1e96940>: timeout('timed out',)

the interval between 1 event to another is quite short - in seconds.

What is the result that you expected?

It could publish the event normally like my other services that use Go and Node library. All of them use same configuration, so I expect it would behave same.

This issue has been occurred since 2 months ago when I was still using version 0.28.0

@weisurya
Copy link
Author

weisurya commented May 8, 2020

Maybe it's because of this hardcoded timeout limit

https://github.com/DataDog/dd-trace-py/blob/v0.37.0/ddtrace/api.py#L127

@Kyle-Verhoog
Copy link
Member

Hi @weisurya, sorry for the delay here.

Thanks for providing your setup.

The tracer attempts to send traces every 1 second which is probably why you'd be seeing this message on that interval.

Are any traces coming through at all?

I suspect it's not related to the timeout limit. If the requests are taking longer than 2 seconds to send then there's probably a networking issue. Is there something unique about the way you deploy your Python app vs the Go or Node apps? Could you provide a little more insight about how you're deploying your app?

@weisurya
Copy link
Author

@Kyle-Verhoog hey apologies for my late reply, and thank you for the follow-up.

The way I implement DD APM on Python is similar to the implementation of Go & Node.js project, which I use

  • custom hostname
  • custom port
  • custom service name
  • custom environment

besides that, I just use the default configuration from each library.

Yes, I could see some traces on the dashboard but also error report my log regarding this timeout.

@weisurya
Copy link
Author

All of them are using the same deployment, which is in Docker environment as the baseline and deploy in AWS EB.

Specific for Python, I use ddtrace-run CLI command at the beginning to initialize the system.

@ginni-gidwani
Copy link

Any updates on this? We are also seeing the same error messages in our logs with version 0.39.0.

@KonstantinSchubert
Copy link

KonstantinSchubert commented Sep 2, 2020

@Kyle-Verhoog We are seeing these errors intermittently.

  1. Are these send failures fatal as in that data gets lost and does not get re-transmitted?

  2. Does this failure crash the python server process that is handling the request which is being traced?

@Kyle-Verhoog
Copy link
Member

Hi all,

We're aware that this seems to occur but haven't come to a root cause yet due to how randomly this seems to occur. Our speculation so far is that the agent becomes overloaded and is unable to handle the request. We're looking to address it on our end by introducing retry logic for sending.

@KonstantinSchubert:

Are these send failures fatal as in that data gets lost and does not get re-transmitted?

Correct, currently there is no retry logic

Does this failure crash the python server process that is handling the request which is being traced?

No nothing is crashing, the exception is occurring, being caught and finally logged in the worker thread that ddtrace spawns.

@bhardin
Copy link

bhardin commented Dec 7, 2020

Any follow up on this?

@maurits-funda
Copy link

We are seeing these errors too, while using version 0.45.0.

P403n1x87 added a commit to P403n1x87/dd-trace-py that referenced this issue May 20, 2021
This change introduces a Fibonacci retry policy (with jitter) to the
agent writer to mitigate networking issues (e.g. timeouts, broken pipes,
...), similar to what the profiler does already.

Resolves DataDog#1413.
@mergify mergify bot closed this as completed in #2459 May 25, 2021
mergify bot added a commit that referenced this issue May 25, 2021
This change introduces a Fibonacci retry policy (with jitter) to the
agent writer to mitigate networking issues (e.g. timeouts, broken pipes,
...), similar to what the profiler does already.

Resolves #1413.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants