Make Open Telemetry the default instead of StatsD for monitoring #40800

kaxil · 2024-07-15T18:38:19Z

Currently, Apache Airflow uses StatsD for metrics collection and monitoring. To modernize our observability stack and align with industry standards, we should adopt OpenTelemetry as the primary metrics collection and monitoring tool for Airflow 3 while keeping Statsd as an option. This is made easier with AIP-49 Open Telemetry Support.

Backward Compatibility / Migration

Most of the enterprise Airflow deployments rely on StatsD for monitoring & alerting so we should make sure that there is a smooth migration path. Ideally, all the StatsD metrics mentioned in https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions should have a like-by-like replacement.

High-level Objectives:

Use OpenTelemetry as the default metrics collection tool. This includes setting up configurations, dependencies, and code integration.
Documentation Update: Update the official documentation to reflect the change from StatsD to OpenTelemetry as the default, including setup, configuration, and usage instructions. Include migration guide for users transitioning from StatsD to OpenTelemetry if there are breaking changes. This includes the Helm chart.

josix · 2024-07-16T16:03:57Z

Hi @kaxil I'm interested in this issue, may I have a try for this, thanks!

kaxil · 2024-07-17T19:39:49Z

Awesome, assigned it to you

dirrao · 2024-07-18T12:57:16Z

Hi @josix
Let me know if you need any help.

raphaelauv · 2024-07-18T13:28:12Z

also provide a simple grafana dashboard in the example stack ( cause the main airflow grafana dashboard open source is based on statsd ) would help a lot

kaxil · 2024-07-25T13:55:14Z

@howardyoo is also interested in working on this. He led AIP-49, so @josix if you need help please let Howard no.

@howardyoo is equally interested in leading this effort too

josix · 2024-07-25T14:27:08Z

Yeah, I just checked out the AIP and the sharing in the Airflow Submit these days, and currently worked on studying the codebase around StatsD and OTel. I believe it would be a better choice for @howardyoo to lead this topic 🙂 Please feel free to unassign me and assign sub-items to me if possible, thanks for the coordination.

dirrao · 2024-07-25T15:25:31Z

I am interested too in this. Please assign me any tasks needed to be done.
I am working on similar lines on this in this PR #39908

kaxil · 2024-07-25T17:06:51Z

@howardyoo Could you add a comment, please? GitHub doesn't allow me to assign you an issue until you have commented on it since you aren't part of the Apache org

howardyoo · 2024-07-25T22:36:09Z

@howardyoo Could you add a comment, please? GitHub doesn't allow me to assign you an issue until you have commented on it since you aren't part of the Apache org

Ok, done!

kaxil · 2024-07-26T08:05:45Z

Awesome, assigned it to you.

howardyoo · 2024-08-28T17:31:46Z

Oh, I guess the above mentioned PR draft got released, which is nice - but looks like it may be on hold for now.

chomipi88 · 2024-09-04T15:14:19Z

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

howardyoo · 2024-09-04T15:16:12Z

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

raphaelauv · 2024-09-04T15:22:38Z

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Ok but what are your arguments in favor of this policy?

howardyoo · 2024-09-04T23:42:49Z

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Ok but what are your arguments in favor of this policy?

My favor to this policy, is that

Opentelemetry has become quite stable and mature, and is vendor neural way of instrumenting softwares
StatsD only covers metrics, while Opentelemetry covers metrics, traces, and logs
StatsD has limitations in terms of number of attributes, or metric names, while Opentelemetry is less limited
Opentelemetry also provides opentelemetry collector which can further be used to control how the telemetry could be transferred.

raphaelauv · 2024-09-05T05:58:08Z

Yes otel is better ,but why keep statsd ?

potiuk · 2024-09-05T08:34:47Z

I was originally promoting the idea of attempting to use otel metrics to provide backwards compatibility with StatsD (as an option). I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional - let's use single "producing" mechanism (OTEL) but let's make it produce statsd-compatible metrics so that existing dashboards continue to work.

I think there are certain limitations (metrics name length for one) that make it either difficult or impossible or at the very least imperfect. So I am in favour of getting OTEL as "default" interface, but leaving (at least for a long while) statsd as an optional legacy mechanism - available to those who already monitor their airflow with statsd.

The thing is that it's not very difficult to keep both. Metrics change rarely, but most importantly - keeping statsd metrics in the code (since those are only metrics and we keep all the data to produce for OTEL anyway) is not a big overhead.

Yes, it introduces some duplication and possibly the OTEL vs. STATSD metrics will - over time - diverge even more, but that's quite fine. If we focus on improving and making OTEL metrics more useful while treat Statsd as legacy, read-only that we do not actively maintain, this will make OTEL even more attractive over time and we could even recommend people - if you want to get better / improved metrics go OTEL if you are stil using statsd. But we should give people more time to do so and let them do it independently from the decision on migration to Airflow 3.

I see statsd as a way how we can improve adoption of Airflow 3. If we expect future Airflow 3 users to migrate quickly, adding them "yet-another-thing-to-learn-and-configure" gives compounding effect on the difficulty of migrateion and might be a factor where they decide to defer migration decision. And since it does not cost us a lot of maintenance - this is a "cheap" overhead.

howardyoo · 2024-09-05T15:36:45Z

I was originally promoting the idea of attempting to use otel metrics to provide backwards compatibility with StatsD (as an option). I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional - let's use single "producing" mechanism (OTEL) but let's make it produce statsd-compatible metrics so that existing dashboards continue to work.

I think there are certain limitations (metrics name length for one) that make it either difficult or impossible or at the very least imperfect. So I am in favour of getting OTEL as "default" interface, but leaving (at least for a long while) statsd as an optional legacy mechanism - available to those who already monitor their airflow with statsd.

The thing is that it's not very difficult to keep both. Metrics change rarely, but most importantly - keeping statsd metrics in the code (since those are only metrics and we keep all the data to produce for OTEL anyway) is not a big overhead.

Yes, it introduces some duplication and possibly the OTEL vs. STATSD metrics will - over time - diverge even more, but that's quite fine. If we focus on improving and making OTEL metrics more useful while treat Statsd as legacy, read-only that we do not actively maintain, this will make OTEL even more attractive over time and we could even recommend people - if you want to get better / improved metrics go OTEL if you are stil using statsd. But we should give people more time to do so and let them do it independently from the decision on migration to Airflow 3.

I see statsd as a way how we can improve adoption of Airflow 3. If we expect future Airflow 3 users to migrate quickly, adding them "yet-another-thing-to-learn-and-configure" gives compounding effect on the difficulty of migrateion and might be a factor where they decide to defer migration decision. And since it does not cost us a lot of maintenance - this is a "cheap" overhead.

@potiuk beat me into responding!
but yes, the overhead is little, statsd had been the default choice for years, so maintaining backward compatibility is still important, yet now users who are using statsd could experiment and adopt otel as their next telemetry while they can still use what they're used to would be my opinion.

ferruzzi · 2024-09-05T17:16:09Z

I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional

My apologies for misrepresenting your stand on that. I shouldn't have spoken for you.

I have mentioned my solution in the past but perhaps I need to codify it, or open an Issue and let someone else tackle it. Currently there are three metrics backends (called loggers) StasD, OTel, and Datadog (which as far as I can tell is a variant on StatsD which supports tagging). The biggest compatibility issue is that OTel has a much shorter max name length (63 characters), but supports tagging. StatsD has a much longer (300 character?) name length limit but does not support tagging.

Solution:

Add a "generate name" abstract method to base metrics logger, something like:

def get_name(name: str, tags:dict) -> str:
    ...

Each logger will then need to implement this and call it before actually emitting the metric. For otel, it would just return name. for StatsD it would return the name with the tags dict concatenated onto it, something more or less like this should work but there's likely a better way

def get_name(name: str, tags:dict) -> str:
    suffix = str(tags)[1:-1] # drop the open and close curly braces
        .replace(": ", "_") # replace the key/value delimiter with an underscore
        .replace(", ", "_") # replace the comma between keys/value pairs with an underscore
        .replace("'", "")   # remove the quotes around the key name 
    
    return f"{name}_{suffix}"

Then we can stop double-emitting, remove the "we have to truncate this name" name length warning, and let each logger handle naming errors however it sees fit.

[EDIT]
shorter but nowhere near as clear statsd implementation

def get_name(name: str, tags:dict) -> str:
    return f'{name}_{("_").join([f"{k}_{v}" for (k, v) in tags.items()])}'

kaxil · 2024-09-05T22:30:25Z

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Yup, I agree with this

dthauvin · 2024-11-18T10:41:32Z

I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional

My apologies for misrepresenting your stand on that. I shouldn't have spoken for you.

I have mentioned my solution in the past but perhaps I need to codify it, or open an Issue and let someone else tackle it. Currently there are three metrics backends (called loggers) StasD, OTel, and Datadog (which as far as I can tell is a variant on StatsD which supports tagging). The biggest compatibility issue is that OTel has a much shorter max name length (63 characters), but supports tagging. StatsD has a much longer (300 character?) name length limit but does not support tagging.

Solution:

Add a "generate name" abstract method to base metrics logger, something like:
def get_name(name: str, tags:dict) -> str:
    ...
Each logger will then need to implement this and call it before actually emitting the metric. For otel, it would just return name. for StatsD it would return the name with the tags dict concatenated onto it, something more or less like this should work but there's likely a better way
def get_name(name: str, tags:dict) -> str:
    suffix = str(tags)[1:-1] # drop the open and close curly braces
        .replace(": ", "_") # replace the key/value delimiter with an underscore
        .replace(", ", "_") # replace the comma between keys/value pairs with an underscore
        .replace("'", "")   # remove the quotes around the key name 
    
    return f"{name}_{suffix}"
Then we can stop double-emitting, remove the "we have to truncate this name" name length warning, and let each logger handle naming errors however it sees fit.

[EDIT] shorter but nowhere near as clear statsd implementation
def get_name(name: str, tags:dict) -> str:
    return f'{name}_{("_").join([f"{k}_{v}" for (k, v) in tags.items()])}'

Hi @ferruzzi ,i don't know if you see this , but opentelemetry specification goes from 63 characters to 255 open-telemetry/opentelemetry-specification#3648 .
With OpenTelemetry configuration i got a lot of truncate metrics . Specially those with the dag name inside . it's very difficult to predict the metrics name and so make observability dashboard automation

exceeds OpenTelemetry's name length limit of 63 characters and will be truncated to

ferruzzi · 2024-11-18T17:37:39Z

Thanks for pointing it out. We have a fix for it coming in #43340

kaxil mentioned this issue Jul 15, 2024

Release Airflow 3.0 #39593

Open

10 tasks

dosubot bot added the telemetry Telemetry-related issues label Jul 15, 2024

kaxil added involves core breaking change airflow3.0:candidate Potential candidates for Airflow 3.0 airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes labels Jul 15, 2024

kaxil assigned josix Jul 17, 2024

kaxil assigned dirrao and unassigned josix Jul 25, 2024

kaxil assigned howardyoo Jul 26, 2024

dirrao mentioned this issue Jul 26, 2024

Airflow 3: Statsd removal and enable OTEL as default metrics agent #41044

Closed

uranusjr removed the involves core breaking change label Aug 15, 2024

ferruzzi mentioned this issue Sep 5, 2024

metric name in otel metric logger will get truncated if they exceed limit of 63 characters #42038

Closed

2 tasks

kaxil changed the title ~~Remove StatsD and replace it with Open Telemetry as first-class citizen~~ Make Open Telemetry the default instead of StatsD for monitoring Sep 5, 2024

ferruzzi mentioned this issue Sep 6, 2024

Align timers and timing metrics (ms) across all metrics loggers #39908

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Open Telemetry the default instead of StatsD for monitoring #40800

Make Open Telemetry the default instead of StatsD for monitoring #40800

kaxil commented Jul 15, 2024 •

edited

Loading

josix commented Jul 16, 2024

kaxil commented Jul 17, 2024

dirrao commented Jul 18, 2024

raphaelauv commented Jul 18, 2024

kaxil commented Jul 25, 2024

josix commented Jul 25, 2024

dirrao commented Jul 25, 2024

kaxil commented Jul 25, 2024

howardyoo commented Jul 25, 2024

kaxil commented Jul 26, 2024

howardyoo commented Aug 28, 2024 •

edited

Loading

chomipi88 commented Sep 4, 2024

howardyoo commented Sep 4, 2024

raphaelauv commented Sep 4, 2024

howardyoo commented Sep 4, 2024

raphaelauv commented Sep 5, 2024 •

edited

Loading

potiuk commented Sep 5, 2024

howardyoo commented Sep 5, 2024

ferruzzi commented Sep 5, 2024 •

edited

Loading

kaxil commented Sep 5, 2024

dthauvin commented Nov 18, 2024 •

edited

Loading

ferruzzi commented Nov 18, 2024

Make Open Telemetry the default instead of StatsD for monitoring #40800

Make Open Telemetry the default instead of StatsD for monitoring #40800

Comments

kaxil commented Jul 15, 2024 • edited Loading

Backward Compatibility / Migration

josix commented Jul 16, 2024

kaxil commented Jul 17, 2024

dirrao commented Jul 18, 2024

raphaelauv commented Jul 18, 2024

kaxil commented Jul 25, 2024

josix commented Jul 25, 2024

dirrao commented Jul 25, 2024

kaxil commented Jul 25, 2024

howardyoo commented Jul 25, 2024

kaxil commented Jul 26, 2024

howardyoo commented Aug 28, 2024 • edited Loading

chomipi88 commented Sep 4, 2024

howardyoo commented Sep 4, 2024

raphaelauv commented Sep 4, 2024

howardyoo commented Sep 4, 2024

raphaelauv commented Sep 5, 2024 • edited Loading

potiuk commented Sep 5, 2024

howardyoo commented Sep 5, 2024

ferruzzi commented Sep 5, 2024 • edited Loading

kaxil commented Sep 5, 2024

dthauvin commented Nov 18, 2024 • edited Loading

ferruzzi commented Nov 18, 2024

kaxil commented Jul 15, 2024 •

edited

Loading

howardyoo commented Aug 28, 2024 •

edited

Loading

raphaelauv commented Sep 5, 2024 •

edited

Loading

ferruzzi commented Sep 5, 2024 •

edited

Loading

dthauvin commented Nov 18, 2024 •

edited

Loading