Failed to send traces to Datadog Agent - Time out #1413

weisurya · 2020-05-07T06:21:19Z

Which version of dd-trace-py are you using?

0.37.0

Which version of the libraries are you using?

aiohttp==3.6.2
amqp==2.5.0
aniso8601==4.1.0
astroid==2.2.5
async-timeout==3.0.1
atomicwrites==1.3.0
attrs==18.2.0
billiard==3.6.0.0
blinker==1.4
boto3==1.9.195
botocore==1.12.195
certifi==2019.3.9
cfgv==3.1.0
chardet==3.0.4
Click==7.0
codacy-coverage==1.3.11
colorama==0.4.1
coverage==4.5.2
ddtrace==0.37.0
docutils==0.14
filelock==3.0.12
Flask==1.0.2
Flask-OpenTracing==0.2.0
Flask-RESTful==0.3.7
funcsigs==1.0.2
gunicorn==19.9.0
identify==1.4.15
idna==2.8
importlib-metadata==1.6.0
intervaltree==3.0.2
isort==4.3.21
itsdangerous==1.1.0
jaeger-client==3.13.0
Jinja2==2.10
jmespath==0.9.4
joblib==0.13.1
jsonschema==3.0.1
kombu==4.6.3
lazy-object-proxy==1.4.1
logdna==1.2.8
MarkupSafe==1.1.0
mccabe==0.6.1
more-itertools==5.0.0
msgpack==1.0.0
multidict==4.7.5
nodeenv==1.3.5
numpy==1.16.1
opentracing==1.3.0
pamqp==2.3.0
pandas==0.24.2
pep8==1.7.1
pika==1.0.1
pluggy==0.8.1
pre-commit==2.3.0
protobuf==3.10.0
psutil==5.6.3
psycopg2==2.8.4
py==1.7.0
pycodestyle==2.5.0
pylint==2.3.1
pyrsistent==0.15.3
pytest==4.2.0
pytest-cov==2.6.1
python-dateutil==2.8.0
python-dotenv==0.10.3
pytz==2018.9
PyYAML==5.1.2
rabbitpy==2.0.1
raven==6.10.0
ray==0.7.5
redis==3.3.11
requests==2.21.0
s3transfer==0.2.1
scikit-learn==0.20.3
scipy==1.2.1
sentry-sdk==0.12.0
setproctitle==1.1.10
six==1.12.0
slackclient==2.5.0
sortedcontainers==2.1.0
threadloop==1.0.2
thrift==0.11.0
toml==0.10.0
tornado==4.5.3
typed-ast==1.4.0
urllib3==1.24.1
vine==1.3.0
virtualenv==16.4.3
websocket-client==0.54.0
Werkzeug==0.14.1
wrapt==1.11.2
yarl==1.4.2
zipp==3.1.0

You can copy/paste the output of pip freeze here.

aiohttp==3.6.2
amqp==2.5.0
aniso8601==4.1.0
astroid==2.2.5
async-timeout==3.0.1
atomicwrites==1.3.0
attrs==18.2.0
billiard==3.6.0.0
blinker==1.4
boto3==1.9.195
botocore==1.12.195
certifi==2019.3.9
cfgv==3.1.0
chardet==3.0.4
Click==7.0
codacy-coverage==1.3.11
colorama==0.4.1
coverage==4.5.2
ddtrace==0.37.0
docutils==0.14
filelock==3.0.12
Flask==1.0.2
Flask-OpenTracing==0.2.0
Flask-RESTful==0.3.7
funcsigs==1.0.2
gunicorn==19.9.0
identify==1.4.15
idna==2.8
importlib-metadata==1.6.0
intervaltree==3.0.2
isort==4.3.21
itsdangerous==1.1.0
jaeger-client==3.13.0
Jinja2==2.10
jmespath==0.9.4
joblib==0.13.1
jsonschema==3.0.1
kombu==4.6.3
lazy-object-proxy==1.4.1
logdna==1.2.8
MarkupSafe==1.1.0
mccabe==0.6.1
more-itertools==5.0.0
msgpack==1.0.0
multidict==4.7.5
nodeenv==1.3.5
numpy==1.16.1
opentracing==1.3.0
pamqp==2.3.0
pandas==0.24.2
pep8==1.7.1
pika==1.0.1
pluggy==0.8.1
pre-commit==2.3.0
protobuf==3.10.0
psutil==5.6.3
psycopg2==2.8.4
py==1.7.0
pycodestyle==2.5.0
pylint==2.3.1
pyrsistent==0.15.3
pytest==4.2.0
pytest-cov==2.6.1
python-dateutil==2.8.0
python-dotenv==0.10.3
pytz==2018.9
PyYAML==5.1.2
rabbitpy==2.0.1
raven==6.10.0
ray==0.7.5
redis==3.3.11
requests==2.21.0
s3transfer==0.2.1
scikit-learn==0.20.3
scipy==1.2.1
sentry-sdk==0.12.0
setproctitle==1.1.10
six==1.12.0
slackclient==2.5.0
sortedcontainers==2.1.0
threadloop==1.0.2
thrift==0.11.0
toml==0.10.0
tornado==4.5.3
typed-ast==1.4.0
urllib3==1.24.1
vine==1.3.0
virtualenv==16.4.3
websocket-client==0.54.0
Werkzeug==0.14.1
wrapt==1.11.2
yarl==1.4.2
zipp==3.1.0

How can we reproduce your problem?

Here is how I initiate the project with ddtrace
ddtrace-run gunicorn index:app

and on the system env, I customized these variables

DATADOG_SERVICE_NAME=<custom service name>
DATADOG_TRACE_AGENT_HOSTNAME=<dedicated hostname>
DATADOG_TRACE_AGENT_PORT=<port number>
DATADOG_ENV=<name of environment>
DATADOG_TRACE_ENABLED=true

What is the result that you get?

Failed to send traces to Datadog Agent at <ddtrace.api.API object at 0x7f20a1e96940>: timeout('timed out',)

the interval between 1 event to another is quite short - in seconds.

What is the result that you expected?

It could publish the event normally like my other services that use Go and Node library. All of them use same configuration, so I expect it would behave same.

This issue has been occurred since 2 months ago when I was still using version 0.28.0

The text was updated successfully, but these errors were encountered:

weisurya · 2020-05-08T02:21:39Z

Maybe it's because of this hardcoded timeout limit

https://github.com/DataDog/dd-trace-py/blob/v0.37.0/ddtrace/api.py#L127

Kyle-Verhoog · 2020-05-14T04:45:59Z

Hi @weisurya, sorry for the delay here.

Thanks for providing your setup.

The tracer attempts to send traces every 1 second which is probably why you'd be seeing this message on that interval.

Are any traces coming through at all?

I suspect it's not related to the timeout limit. If the requests are taking longer than 2 seconds to send then there's probably a networking issue. Is there something unique about the way you deploy your Python app vs the Go or Node apps? Could you provide a little more insight about how you're deploying your app?

weisurya · 2020-05-27T07:31:32Z

@Kyle-Verhoog hey apologies for my late reply, and thank you for the follow-up.

The way I implement DD APM on Python is similar to the implementation of Go & Node.js project, which I use

custom hostname
custom port
custom service name
custom environment

besides that, I just use the default configuration from each library.

Yes, I could see some traces on the dashboard but also error report my log regarding this timeout.

weisurya · 2020-05-27T07:33:16Z

All of them are using the same deployment, which is in Docker environment as the baseline and deploy in AWS EB.

Specific for Python, I use ddtrace-run CLI command at the beginning to initialize the system.

ginni-gidwani · 2020-07-28T21:37:58Z

Any updates on this? We are also seeing the same error messages in our logs with version 0.39.0.

KonstantinSchubert · 2020-09-02T12:11:03Z

@Kyle-Verhoog We are seeing these errors intermittently.

Are these send failures fatal as in that data gets lost and does not get re-transmitted?
Does this failure crash the python server process that is handling the request which is being traced?

Kyle-Verhoog · 2020-09-02T15:32:07Z

Hi all,

We're aware that this seems to occur but haven't come to a root cause yet due to how randomly this seems to occur. Our speculation so far is that the agent becomes overloaded and is unable to handle the request. We're looking to address it on our end by introducing retry logic for sending.

@KonstantinSchubert:

Are these send failures fatal as in that data gets lost and does not get re-transmitted?

Correct, currently there is no retry logic

Does this failure crash the python server process that is handling the request which is being traced?

No nothing is crashing, the exception is occurring, being caught and finally logged in the worker thread that ddtrace spawns.

bhardin · 2020-12-07T20:39:36Z

Any follow up on this?

maurits-funda · 2021-01-26T08:17:24Z

We are seeing these errors too, while using version 0.45.0.

This change introduces a Fibonacci retry policy (with jitter) to the agent writer to mitigate networking issues (e.g. timeouts, broken pipes, ...), similar to what the profiler does already. Resolves DataDog#1413.

This change introduces a Fibonacci retry policy (with jitter) to the agent writer to mitigate networking issues (e.g. timeouts, broken pipes, ...), similar to what the profiler does already. Resolves #1413. Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

P403n1x87 mentioned this issue May 20, 2021

feat(writer): add retry logic #2459

Merged

4 tasks

mergify bot closed this as completed in #2459 May 25, 2021

phillipuniverse mentioned this issue Nov 11, 2021

OpenTelemetry DataDog exporter is broken because of breaking change in this package. #2269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to send traces to Datadog Agent - Time out #1413

Failed to send traces to Datadog Agent - Time out #1413

weisurya commented May 7, 2020

weisurya commented May 8, 2020

Kyle-Verhoog commented May 14, 2020

weisurya commented May 27, 2020

weisurya commented May 27, 2020

ginni-gidwani commented Jul 28, 2020

KonstantinSchubert commented Sep 2, 2020 •

edited

Loading

Kyle-Verhoog commented Sep 2, 2020

bhardin commented Dec 7, 2020

maurits-funda commented Jan 26, 2021

Failed to send traces to Datadog Agent - Time out #1413

Failed to send traces to Datadog Agent - Time out #1413

Comments

weisurya commented May 7, 2020

Which version of dd-trace-py are you using?

Which version of the libraries are you using?

How can we reproduce your problem?

What is the result that you get?

What is the result that you expected?

weisurya commented May 8, 2020

Kyle-Verhoog commented May 14, 2020

weisurya commented May 27, 2020

weisurya commented May 27, 2020

ginni-gidwani commented Jul 28, 2020

KonstantinSchubert commented Sep 2, 2020 • edited Loading

Kyle-Verhoog commented Sep 2, 2020

bhardin commented Dec 7, 2020

maurits-funda commented Jan 26, 2021

KonstantinSchubert commented Sep 2, 2020 •

edited

Loading